Implementing Arc Relationships with Virtual Columns? Or Not?

I wrote a post some time ago about implementing arc relationships using virtual columns.
Recently, Toon Koppelaars wrote a detailed and reasoned comment to that post. Since I admire Toon, getting his point of view on something that I wrote is a privilege for me, regardless if he agrees with me or disagrees (and just to be clear, it’s the latter this time). I think that having a public (and civilized) discussion – this time about principles of data modeling and implementation – is a good thing, even if at the end we don’t convince each other.
Therefore I thought Toon’s comment deserves a post of its own. So I’ll quote here everything that he wrote in the comment, with my response after every sentence or paragraph. And everyone is welcome to add their own points of view in the comments section.

I’d like to emphasize that basically I agree with Toon about the general guidelines. What makes life harder, as always, are the nuances, or as we know them well as “it depends”.

And just one general observation first. The starting point of the original post was that the design included an arc. This is the given fact, and my intent was to suggest an implementation for this given fact. I don’t want to put words in your mouth (or keyboard), Toon, but I think that the basis for your arguments is that the arc, as a design concept, is wrong. I think you would prefer to see several entities in the ERD instead of the one with the arc. Am I right in this interpretation?

Toon wrote:
Oren, I just cannot, not comment on this post 😉

And I really appreciate it. Really.

Toon wrote:
Re: Multiple Tables
1) What do you mean by “maintaining another table”? And why is that a disadvantage?

I’ll refer to it after the next paragraph.

Toon wrote:
In my opinion, you should always have a software factory in which the act of introducing a new table is cheap. Why? Because otherwise you end up with database designs with few tables, into which more than few “things” are stored. Which is bad. It’s bad for understandability, it’s bad for performance, it’s bad for future maintenance.

And I completely agree with you.

Toon wrote:
Apparently adding a table is not cheap in your factory?
But, yet, dropping and recreating a CHECK constraint (to involve a new single-char value) is cheap?

As you know me and my database development principles, beliefs and guidelines (for example, here and here), I assume this is a rhetorical question. Of course adding a table is cheap. Actually it is one of the cheapest tasks.
And dropping and recreating a CHECK constraint is not expensive either. It may take some time to VALIDATE it if the table is big, but this is not a reason not to do it. And anyway, if, in order to implement a good design, something is “expensive” (but still realistic), it should not be a reason not to do it (at least, not a good reason).

But, if we have, by design, several tables that are identical – in column names, data types, indexes, constraints – everything except for the table on the other side of the “owner” foreign key, then I start to feel uncomfortable. I get the same feeling that I get when I see two procedures that do the same (or almost the same) thing.
And although many times the arc appears for simple join tables, as in my example, this is not always the case. The entity with the arc may include many columns.
In addition, by “maintaining another table” I did not mean just the CREATE TABLE statement. When a new attribute is added to the designed entity, we need to add a new column – exactly the same column – to all the tables. And the same is true for any change – renaming a column, changing data types, adding or removing constraints, etc.
In my opinion, if I have to repeat the same work over and over instead of doing it just once, I probably do something wrong.
It also reminds me of denormalization. Usually when we talk about denormalization we mean that the same data is kept more than once. I think it’s the same here – just that the duplicate data is kept in the data dictionary tables and not in our schema. And one of the disadvantages of denormalization is that it increases the risk for bugs; in this case, due to a human error we may end up with PERSON_ADDRESSES and COMPANY_ADDRESSESS that are not exactly the same, although they should be.

Toon wrote:
2) Why is the implied code-difference for this solution, more difficult to maintain?

Here I was mainly thinking about lack of reusability.

Toon wrote:
Instead of having these two inserts:
insert into entity_addresses values(AID1,’P’,PID1), and insert into entity_addresses values(AID2,’C’,CID1).

You’d have these two:
insert into person_addresses(AID1,PID1), and insert into company_addresses(AID2,CID1).

Why is the former “cheaper”?

I was thinking of writing only one INSERT statement:

CREATE OR REPLACE PACKAGE BODY address_mgr AS

    PROCEDURE add_entity_address
    (
        i_address_id    IN entity_addresses.address_id%TYPE,
        i_owner_type_id IN entity_addresses.owner_type_id%TYPE,
        i_owner_id      IN entity_addresses.owner_id%TYPE
    ) IS
    BEGIN
        INSERT INTO entity_addresses
            (address_id,
             owner_type_id,
             owner_id)
        VALUES
            (i_address_id,
             i_owner_type_id,
             i_owner_id);
    END add_entity_address;
.
.
.
END address_mgr;
/

Toon wrote:
And this trickles down to all other kinds of SQL statements. Why is:

select *
from adresses a
entity_addresses ea
,persons p
where a.id = ea.address_id
and ea.object_type = ‘P’
and ea.object_id = p.person_id

apparently easier to write and/or maintain (i.e. cheaper), than:

select *
from adresses a
person_addresses pa
,persons p
where a.id = ea.person_id
and pa.person_id = p.person_id

?
I just don’t get that…

Again, I was thinking of a more generic code, rather than duplicating the same code pattern. Something like this:

create type address_t as object (
  id integer,
  street varchar2(30),
  house_number varchar2(10),
  city varchar2(30),
  country varchar2(30)
)
/

create type address_tt as table of address_t 
/

CREATE OR REPLACE PACKAGE BODY address_mgr AS
.
.
.
    PROCEDURE get_addresses
    (
        i_owner_type_id IN entity_addresses.owner_type_id%TYPE,
        i_owner_id      IN entity_addresses.owner_id%TYPE,
        o_addresses     OUT address_tt
    ) IS
    BEGIN
        SELECT address_t(a.id, a.street, a.house_number, a.city, a.country)
        BULK   COLLECT
        INTO   o_addresses
        FROM   addresses        a,
               entity_addresses ea
        WHERE  ea.owner_type_id = i_owner_type_id
        AND    ea.owner_id = i_owner_id
        AND    a.id = ea.address_id;
    END get_addresses;
.
.
.
END address_mgr;
/

CREATE OR REPLACE PACKAGE BODY person_mgr AS

    PROCEDURE get_person
    (
        i_person_id  IN people.id%TYPE,
        o_first_name OUT people.first_name%TYPE,
        o_last_name  OUT people.last_name%TYPE,
        o_addresses  OUT address_tt
    ) IS
    BEGIN
        SELECT p.first_name,
               p.last_name
        INTO   o_first_name,
               o_last_name
        FROM   people p
        WHERE  p.id = i_person_id;
    
        address_mgr.get_addresses(i_owner_type_id => 'P',
                                  i_owner_id      => i_person_id,
                                  o_addresses     => o_addresses);
    END get_person;
.
.
.
END person_mgr;
/

CREATE OR REPLACE PACKAGE BODY company_mgr AS

    PROCEDURE get_company
    (
        i_company_id          IN companies.id%TYPE,
        o_name                OUT companies.name%TYPE,
        o_number_of_employees OUT companies.number_of_employees%TYPE,
        o_description         OUT companies.description%TYPE,
        o_addresses           OUT address_tt
    ) IS
    BEGIN
        SELECT c.name,
               c.number_of_employees,
               c.description
        INTO   o_name,
               o_number_of_employees,
               o_description
        FROM   companies c
        WHERE  c.id = i_company_id;
    
        address_mgr.get_addresses(i_owner_type_id => 'C',
                                  i_owner_id      => i_company_id,
                                  o_addresses     => o_addresses);
    END get_company;
.
.
.
END company_mgr;
/

By the way, we can achieve the same level of code reusability with the “multiple tables” option, by using dynamic SQL instead of static SQL, but I think both of us do not want to go there, do we?

Toon wrote:
To me this alternative (Multiple Tables) *is* the way to deal with the information requirement.
Having entity_adresses, introduces a column in my design (owner_id) whose meaning depends upon the value of another column (owner_type), which is just bad in my opinion.

Isn’t this the case with every composite key, that we need the values of all the columns that compose the key in order to uniquely identify the entity?

Toon wrote:
This alternative also offers the best chances of being able to easily cater for possible future differences in information requirements at the relationship-level. Eg. we may want to track since when a person has had a particular address, but we don’t care for that information requirement in case of companies: you then just add the ‘as-of’ date column to the person_addresses table only, and be done with it.

In my opinion, this becomes a different case. If the entities are not identical in their attributes, I would consider them different entities, and therefore each one deserves its own table. And since I’m not afraid of changing my schema and my code, I would split the generic table into multiple tables only when the need arises.

Toon wrote:
It is the most simple, and easy to understand for others that come join the maintenance team.
Imagine coming in and seeing the table with invisible generated virtual columns that have FK’s on them. Hmmm…

I would probably say: “what a cool idea” 😉

Adding a Column with a Default Value and a Constraint

The Constraint Optimization series:

Part 1: Optimization of Check Constraint Creation
Part 2: Optimization that Violates Data Integrity
Part 3: Optimization of Foreign Key Constraint Creation
Part 4: (Lack of) Optimization of Unique Constraint Creation
Part 5: Adding a Column with a Default Value and a Constraint
Part 6: Fast but Offline, or Online but Slow?
Part 7: Adding a Unique Constraint in an Online Way
Part 8: Constraint Optimization Summary

In the previous parts of this series I showed that Oracle does a nice optimization – that may save plenty of time – when we add in a single ALTER TABLE statement a new nullable with no default value column and a corresponding:
– inline (a.k.a. “column-level”) check constraint
or
– foreign key constraint (either inline or out-of-line)

In these cases Oracle enables the constraint (for future DML statements) and marks the constraint as VALIDATED (stating that existing records respect the constraint) without actually performing any check – counting on the fact that all the existing records have NULL in the new column. This is a beautiful optimization – if the table already contains many records, then skipping the (unnecessary) validation phase can save lots of time (during which the table is locked in a highly restrictive mode [I’ll discuss locking in the next part of this series]).

What about adding a new column with a (constant) default value?

Since all the existing records will have the same value in the new column (the default value) – then the validation phase can be very simple and short. Oracle should only check that this single value respects the constraint; there is no reason to visit each and every record in the table for repeating the same validation over and over again. Unfortunately, this kind of optimization is not done.

I added this suggestion to the OTN Database Ideas section: http://community.oracle.com/ideas/17751. Please vote up if you think it’s a good idea.

With the latest releases of Oracle, such an optimization would be valuable in particular, because as of 11g adding a NOT NULLable column with a default value is a metadata-only operation, and as of 12c the same is true also for adding a NULLable column with a default value.

Here is an example with a check constraint:

> alter table t add (c number default 8 not null constraint c_chk check(c>0));

We can see in the SQL trace file the following:

=====================
PARSING IN CURSOR #407309168 len=110 dep=1 uid=194 oct=3 lid=0 tim=745434447778 hv=3957567910 ad='7ffb3434b250' sqlid='2z018fgpy7cd6'
 select /*+ all_rows ordered dynamic_sampling(2) */ A.rowid, :1, :2, :3 from "DEMO"."T" A where not ( c>0)
END OF STMT
.
.
.
FETCH #407309168:c=781250,e=3187349,p=142858,cr=142882,cu=0,mis=0,r=0,dep=1,og=1,plh=1601196873,tim=745437636908
STAT #407309168 id=1 cnt=0 pid=0 pos=1 obj=99064 op='TABLE ACCESS FULL T (cr=142882 pr=142858 pw=0 time=3187348 us cost=38895 size=1250000 card=50000)'
CLOSE #407309168:c=0,e=3,dep=1,type=0,tim=745437636980

And here is an example with a foreign key constraint:

> alter table c add (
  parent_id number default 1 not null constraint c_fk_p references p(id)
  );

PARSING IN CURSOR #386313832 len=207 dep=1 uid=0 oct=3 lid=0 tim=747182328580 hv=3249972061 ad='7ffb09624948' sqlid='83w6f8v0vd8ux'
 select /*+ all_rows ordered dynamic_sampling(2) */ A.rowid, :1, :2, :3 
 from "DEMO"."C" A , "DEMO"."P" B 
 where( "A"."PARENT_ID" is not null) and( "B"."ID" (+)= "A"."PARENT_ID") and( "B"."ID" is null)
END OF STMT
.
.
.
FETCH #386313832:c=906250,e=3396711,p=142857,cr=142883,cu=0,mis=0,r=0,dep=1,og=1,plh=1351743368,tim=747185726213
STAT #386313832 id=1 cnt=0 pid=0 pos=1 obj=0 op='HASH JOIN ANTI (cr=142883 pr=142857 pw=0 time=3396711 us cost=40650 size=38000000 card=1000000)'
STAT #386313832 id=2 cnt=1000000 pid=1 pos=1 obj=98387 op='TABLE ACCESS FULL C (cr=142882 pr=142857 pw=0 time=616429 us cost=38895 size=25000000 card=1000000)'
STAT #386313832 id=3 cnt=1 pid=1 pos=2 obj=98386 op='INDEX FULL SCAN P_PK (cr=1 pr=0 pw=0 time=8 us cost=0 size=13 card=1)'
CLOSE #386313832:c=0,e=3,dep=1,type=0,tim=747185726320

(Lack of) Optimization of Unique Constraint Creation

The Constraint Optimization series:

Part 1: Optimization of Check Constraint Creation
Part 2: Optimization that Violates Data Integrity
Part 3: Optimization of Foreign Key Constraint Creation
Part 4: (Lack of) Optimization of Unique Constraint Creation
Part 5: Adding a Column with a Default Value and a Constraint
Part 6: Fast but Offline, or Online but Slow?
Part 7: Adding a Unique Constraint in an Online Way
Part 8: Constraint Optimization Summary

In the previous parts of this series I showed that Oracle does a nice optimization – that may save plenty of time – when we add in a single ALTER TABLE statement a new (nullable with no default value) column and a corresponding:
– inline (a.k.a. “column-level”) check constraint
or
– foreign key constraint (either inline or out-of-line)

What about unique constraints?

When we create a unique constraint on some column, Oracle creates by default a corresponding unique B*Tree index on that column, in order to enforce the constraint.
When we add a new column with no default value to a non-empty table, and in the same ALTER TABLE statement we create a unique constraint on that column, then the corresponding index will be created, by definition, with no entries (as B*Tree indexes do not contain entirely NULL keys). It would be nice if Oracle would simply create an empty index in this case, without scanning all the table rows, but unfortunately this kind of optimization is not done.
Continue reading “(Lack of) Optimization of Unique Constraint Creation”

Optimization of Foreign Key Constraint Creation

The Constraint Optimization series:

Part 1: Optimization of Check Constraint Creation
Part 2: Optimization that Violates Data Integrity
Part 3: Optimization of Foreign Key Constraint Creation
Part 4: (Lack of) Optimization of Unique Constraint Creation
Part 5: Adding a Column with a Default Value and a Constraint
Part 6: Fast but Offline, or Online but Slow?
Part 7: Adding a Unique Constraint in an Online Way
Part 8: Constraint Optimization Summary

In a recent post I showed that Oracle does a nice optimization when we add a new (nullable with no default value) column with an inline (a.k.a. “column-level”) check constraint in a single ALTER TABLE statement. This optimization does not apply for out-of-line (“table-level”) check constraints.

So, what about foreign key constraints?

Clearly, when adding a new nullable with no default value column to a table which contains records, then, by definition, the new column has NULL in all the existing records. If in the same ALTER TABLE statement we also create a foreign key constraint for the new column, then, by definition, the constraint is valid for all the existing records, because NULLs always pass the foreign key constraint check. So in this case it is unnecessary to access each and every record of the table for the validation phase of the foreign key constraint creation, and indeed Oracle applies this optimization, in both inline and out-of-line foreign key constraints. Let’s see it.
Continue reading “Optimization of Foreign Key Constraint Creation”

Optimization of Check Constraint Creation

The Constraint Optimization series:

I have a table T with many records and I want to add a new column C to this table with some check constraint on C.

Does it matter if I use the following statement

ALTER TABLE T ADD (C NUMBER, CONSTRAINT C_CHK CHECK (C>0));

or this one

ALTER TABLE T ADD (C NUMBER CONSTRAINT C_CHK CHECK (C>0));

?
Note that the only difference is the comma that appears in the first option and not in the second one.

It certainly matters.
In the first option the constraint is an out-of-line (“table-level”) check constraint, while in the second option it is an inline (“column-level”) one.
With out-of-line check constraint Oracle scans the table, and validates that each and every record satisfies the constraint. This is similar to adding the column in one statement, and then adding the constraint in a second statement.
But when we add the column with an inline check constraint in a single statement, Oracle optimizes this process. Since the added column is NULL for all the existing records, it is actually unnecessary to access the records in order to perform the validation.

We can see a small example (tested in 11.2.0.4, 12.1.0.2, 12.2.0.1). First, adding the column with the out-of-line check constraint:

> create table t (x varchar2(1000));

Table created.

> insert /*+ append */ into t
> select lpad(rownum,1000,'x') from dual connect by level<=1e6;

1000000 rows created.

> commit;

Commit complete.

> col value new_value value_0 noprint
> select b.value from v$statname a, v$mystat b 
> where a.statistic# = b.statistic# and a.name = 'session logical reads';

> set timing on
> ALTER TABLE T ADD (C NUMBER, CONSTRAINT C_CHK CHECK (C>0));

Table altered.

Elapsed: 00:00:04.27

> set timing off
> select b.value-&VALUE_0 logical_reads from v$statname a, v$mystat b
> where a.statistic# = b.statistic# and a.name = 'session logical reads';

LOGICAL_READS
-------------
       143243

Note the elapsed time and the number of logical reads – most of them are due to accessing all the rows of T in order to validate the check constraint.
Now, let’s see the same example but with the inline check constraint instead of the out-of-line one:

> drop table t purge;

Table dropped.

> create table t (x varchar2(1000));

Table created.

> insert /*+ append */ into t
> select lpad(rownum,1000,'x') from dual connect by level<=1e6;

1000000 rows created.

> commit;

Commit complete.

> col value new_value value_0 noprint
> select b.value from v$statname a, v$mystat b
> where a.statistic# = b.statistic# and a.name = 'session logical reads';

> set timing on
> ALTER TABLE T ADD (C NUMBER CONSTRAINT C_CHK CHECK (C>0));

Table altered.

Elapsed: 00:00:00.01

> set timing off
> select b.value-&VALUE_0 logical_reads from v$statname a, v$mystat b
> where a.statistic# = b.statistic# and a.name = 'session logical reads';

LOGICAL_READS
-------------
          359

Much faster, with significantly less logical reads, since accessing the actual records is omitted.

A Recipe for Summoning the RBO Monster (even in Oracle 12c): On Delete Cascade, Function-Based Index and Missing Table Statistics

The last version of Oracle in which CHOOSE was officially supported as an OPTIMIZER_MODE parameter value was 9.2.
This is what the documentation of Oracle 9.2 says about it:

choose
The optimizer chooses between a cost-based approach and a rule-based approach based on whether statistics are available.
If the data dictionary contains statistics for at least one of the accessed tables, then the optimizer uses a cost-based approach and optimizes with a goal of best throughput. If the data dictionary contains only some statistics, then the cost-based approach is used, and the optimizer must guess the statistics for the subjects without any statistics. This can result in sub-optimal execution plans. If the data dictionary contains no statistics for any of the accessed tables, then the optimizer uses a rule-based approach.

In subsequent versions CHOOSE does not appear in the documentation as a valid value, and the “What’s New in Oracle Performance?” chapter in the documentation of Oracle 10.1 includes this sentence:

Some consequences of the desupport of RBO are:
• CHOOSE and RULE are no longer supported as OPTIMIZER_MODE initialization parameter values …

But the Oracle database keeps using CHOOSE, and therefore keeps potentially using the RBO, internally, even in version 12c. And last week I hit a performance issue that was caused because of this fact and some poor circumstances.

Here is a simplified example, running on Oracle 12.1.0.2.

We have two tables – PARENTS and CHILDREN – with an “on delete cascade” foreign key constraint.

> create table parents (
    id number not null primary key
  );

Table created.

> insert into parents select rownum from dual connect by level<=10000;

10000 rows created.

> create table children (
    id number not null primary key,
    parent_id number not null references parents (id) on delete cascade,
    filler varchar2(4000)
  );

Table created.

> insert into children
  select rownum,mod(rownum,10000)+1,lpad('x',4000,'x')
  from dual
  connect by level<=100000;

100000 rows created.

> commit;

Commit complete.

There is an index that supports the foreign key, and it is a function-based index. Note that the leading column of the index is simply the referencing column.

> create index children_idx on children (parent_id, substr(filler,1,10));

Index created.

Now let’s delete a record from the PARENTS table.

> alter system flush buffer_cache;

System altered.

> set timing on
> delete parents where id=1;

1 row deleted.

Elapsed: 00:00:27.80
> set timing off
> roll
Rollback complete.

Why did it take so much time?

When we delete a record from PARENTS, Oracle implicitly deletes the child records of this parent (because of the “on delete cascade” foreign key). This is the recursive SQL statement:

 delete from "DEMO"."CHILDREN" where "PARENT_ID" = :1

Oracle uses the CHOOSE optimizer mode for the implicit deletion from CHILDREN.

> select value,isdefault
  from v$sql_optimizer_env
  where sql_id='f7j1aq9z8k6r1'
  and child_number=0
  and name='optimizer_mode';

VALUE                ISDEFAULT
-------------------- ---------
choose               NO

1 row selected.

Unfortunately I forgot to gather statistics on the CHILDREN table. The lack of statistics means that Oracle chose to use the rule-based optimizer for the implicit DELETE statement.
And this means that it cannot use the CHILDREN_IDX index, because it is a function-based index.

> SELECT * FROM table(DBMS_XPLAN.DISPLAY_CURSOR('f7j1aq9z8k6r1',0));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  f7j1aq9z8k6r1, child number 0
-------------------------------------
 delete from "DEMO"."CHILDREN" where "PARENT_ID" = :1

Plan hash value: 3379301338

---------------------------------------
| Id  | Operation          | Name     |
---------------------------------------
|   0 | DELETE STATEMENT   |          |
|   1 |  DELETE            | CHILDREN |
|*  2 |   TABLE ACCESS FULL| CHILDREN |
---------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("PARENT_ID"=:1)

Note
-----
   - rule based optimizer used (consider using cbo)


23 rows selected.

If I execute the DELETE FROM CHILDREN statement explicitly, the CBO chooses the good execution plan that uses the index, because the statistics on the index have been automatically gathered during the index creation. It’s the missing statistics on the table that caused the implicit (recursive) DELETE statement to use the RBO.

> alter system flush buffer_cache;

System altered.

> set timing on
> delete from "DEMO"."CHILDREN" where "PARENT_ID" = 1;

10 rows deleted.

Elapsed: 00:00:00.21
> set timing off
> select * from table(dbms_xplan.display_cursor());

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  0cm664fx8b944, child number 0
-------------------------------------
delete from "DEMO"."CHILDREN" where "PARENT_ID" = 1

Plan hash value: 1030488021

----------------------------------------------------------------------------------
| Id  | Operation         | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | DELETE STATEMENT  |              |       |       |     2 (100)|          |
|   1 |  DELETE           | CHILDREN     |       |       |            |          |
|*  2 |   INDEX RANGE SCAN| CHILDREN_IDX |    10 | 40100 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("PARENT_ID"=1)


19 rows selected.

Even if I didn’t have statistics on the index, the CBO would still choose the index for performing the explicit DELETE, because it would use dynamic sampling:

> exec dbms_stats.delete_index_stats (user,'CHILDREN_IDX')

PL/SQL procedure successfully completed.

> delete from "DEMO"."CHILDREN" where "PARENT_ID" = 3;

10 rows deleted.

> select * from table(dbms_xplan.display_cursor());

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  a8x8wszgx1g6r, child number 0
-------------------------------------
delete from "DEMO"."CHILDREN" where "PARENT_ID" = 3

Plan hash value: 1030488021

----------------------------------------------------------------------------------
| Id  | Operation         | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | DELETE STATEMENT  |              |       |       |     3 (100)|          |
|   1 |  DELETE           | CHILDREN     |       |       |            |          |
|*  2 |   INDEX RANGE SCAN| CHILDREN_IDX |  1066 |  2111K|     3   (0)| 00:00:01 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("PARENT_ID"=3)

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)


23 rows selected.

Finally, if the index were a regular one rather than function-based, then a good plan would have been chosen for the recursive SQL, because the RBO prefers an index range scan over a full table scan:

> drop index children_idx;

Index dropped.

> create index children_idx on children (parent_id, id);

Index created.

> alter system flush buffer_cache;

System altered.

> set timing on
> delete parents where id=4;

1 row deleted.

Elapsed: 00:00:00.14
> set timing off

> select * from table(DBMS_XPLAN.DISPLAY_CURSOR('f7j1aq9z8k6r1',0));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  f7j1aq9z8k6r1, child number 0
-------------------------------------
 delete from "DEMO"."CHILDREN" where "PARENT_ID" = :1

Plan hash value: 1030488021

------------------------------------------
| Id  | Operation         | Name         |
------------------------------------------
|   0 | DELETE STATEMENT  |              |
|   1 |  DELETE           | CHILDREN     |
|*  2 |   INDEX RANGE SCAN| CHILDREN_IDX |
------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("PARENT_ID"=:1)

Note
-----
   - rule based optimizer used (consider using cbo)


23 rows selected.

The “Guardian Trigger” Design Pattern

Suppose that every time we add records into the T1 table we have to do some additional stuff.
One option to implement this is by using an AFTER INSERT trigger that will perform this additional stuff, but I really dislike this option (mainly because the code becomes hidden in a way, and there may be a negative impact on performance).
I prefer writing a procedure that inserts the records into T1 and performs this additional stuff.

CREATE OR REPLACE PACKAGE my_package AS 
    PROCEDURE insert_t1 (...);
END my_package;
/
CREATE OR REPLACE PACKAGE BODY my_package AS 
    PROCEDURE insert_t1 (...) IS 
    BEGIN 
        INSERT INTO t1 ...;
        -- do some additional stuff, that must be done
        -- whenever inserting new records into T1 
    END insert_t1;
END my_package;
/

But what will happen if someone (perhaps even myself in one year from now) will insert records into T1 from another place (in another procedure or directly via an INSERT statement) and will not know (or maybe forget) about the mandatory additional stuff?
The best thing would be to make this procedure the one and only way to insert records into T1. But how to do it?
I like doing it in a way that I call “the guardian trigger”, using a combination of a global variable and a trigger (but this trigger doesn’t do actual stuff, it is only a gatekeeper).

I’m adding a Boolean global variable and initialize it to FALSE. I change it to TRUE at the beginning of the insert_t1 procedure, and make sure to set it back to FALSE when leaving the procedure.

CREATE OR REPLACE PACKAGE BODY my_package AS 
    g_is_insert_into_t1_allowed BOOLEAN := FALSE; 
    PROCEDURE insert_t1 (...) IS 
    BEGIN 
        g_is_insert_into_t1_allowed := TRUE; 
        INSERT INTO t1 ...;
        -- do some additional stuff, that must be done
        -- whenever inserting new records into T1 
        g_is_insert_into_t1_allowed := FALSE; 
    EXCEPTION 
        WHEN OTHERS THEN 
            g_is_insert_into_t1_allowed := FALSE; 
            RAISE; 
    END insert_t1;
END my_package;
/

Now I’m adding a function – a public function – to get the current value of the global variable:

CREATE OR REPLACE PACKAGE my_package AS 
    FUNCTION is_insert_into_t1_allowed RETURN BOOLEAN;  
    PROCEDURE insert_t1 (...);
END my_package;
/

CREATE OR REPLACE PACKAGE BODY my_package AS 
    g_is_insert_into_t1_allowed BOOLEAN := FALSE;
    FUNCTION is_insert_into_t1_allowed RETURN BOOLEAN IS 
    BEGIN 
        RETURN g_is_insert_into_t1_allowed; 
    END is_insert_into_t1_allowed;  
    PROCEDURE insert_t1 (...) IS 
    BEGIN 
        g_is_insert_into_t1_allowed := TRUE;
        INSERT INTO t1 ...;
        -- do some additional stuff, that must be done
        -- whenever inserting new records into T1 
        g_is_insert_into_t1_allowed := FALSE;
    EXCEPTION 
        WHEN OTHERS THEN 
            g_is_insert_into_t1_allowed := FALSE;
            RAISE;
    END insert_t1;
END my_package;
/

And as the final step I’m creating a statement level trigger – BEFORE INSERT ON T1; a trigger that doesn’t change anything, it just checks that the global variable is currently set to TRUE.

CREATE OR REPLACE TRIGGER t1_trig
    BEFORE INSERT ON t1
BEGIN 
    IF NOT my_package.is_insert_into_t1_allowed THEN 
        raise_application_error(-20000,
          'INSERTs into T1 are allowed only via my_package.insert_t1');
    END IF;
END t1_trig;
/

So if I’ll try to insert records into T1 not via the procedure the trigger will not allow it, it will raise an exception saying “INSERTs into T1 are allowed only via my_package.insert_t1”.
Obviously this pattern can be implemented for other operations, not only for INSERT.

Implementing Arc Relationships with Virtual Columns

Virtual columns were added in Oracle 11g Release 1. We can implement the same concept using views, but one of the advantages that I see in virtual columns is that we can define foreign key constraints on them. Well, we can define foreign key constraints on views as well, but only in DISABLE NOVALIDATE mode, which makes them a decoration rather than a data integrity protector. So to be more precise, we can define enabled foreign key constraints on virtual columns.

This post is about using virtual columns for implementing arc relationships.

In the following ERD, the arc represents the rule “each Entity Address must be owned by either a Person or a Company”:

There are three common ways to implement this logical data model.

Multiple Tables

We can split the ENTITY_ADDRESS entity into two tables – say, PERSON_ADDRESSES and COMPANY_ADDRESSES. Disadvantages of this solution are that we need to maintain two (or more) tables and the code parts that manipulate them, although they are very similar. In addition, if later on we need to add another entity to the relationship (for example, Store, in addition to Person and Company) we will have to add another join table – STORE_ADDRESSES – and the relevant code.

A Single Table with Dedicated Columns

We can implement the ENTITY_ADDRESS entity as a single table – say, ENTITY_ADDRESSES – with a dedicated column for each referenced entity – PERSON_ID and COMPANY_ID (in addition to ADDRESS_ID) – and a check constraint to make sure one and only one of the columns PERSON_ID and COMPANY_ID is not null. The disadvantage of this solution is that we still need to maintain the dedicated columns and the code parts that manipulate them, although they are very similar. And if we need to add another entity to the relationship we will have to add another column to the table and to add the relevant code to manipulate the new column.

A Generic Table

We can implement the ENTITY_ADDRESS entity as a single table – say, ENTITY_ADDRESSES – with a generic column for the referenced entity and a column that stores the entity type to which the generic column references – e.g., ADDRESS_ID, OWNER_TYPE_ID, OWNER_ID. The disadvantage of this solution is that we cannot define foreign key constraints from ENTITY_ADDRESSES to the PEOPLE and COMPANIES tables.

Using Virtual Columns

I’d like to suggest another solution, that, using virtual columns, overcomes the above disadvantages.
This solution is based on the generic table one, but we’ll add also a virtual column for each referenced entity. The purpose of the virtual columns is just for enforcing the referential integrity, and the application should not be aware of them. Therefore, if another entity should be added afterwards, we will add another virtual column, but we will have no code changes to make.

Here it is:

> create table people (
>   id integer not null primary key,
>   first_name varchar2(30) not null,
>   last_name varchar2(30) not null
> );

Table created.

> create table companies (
>   id integer not null primary key,
>   name varchar2(100) not null,
>   number_of_employees integer,
>   description varchar2(1000)
> );

Table created.

> create table addresses (
>   id integer not null primary key,
>   street varchar2(30),
>   house_number varchar2(10),
>   city varchar2(30),
>   country varchar2(30)
> );

Table created.

> create table entity_addresses (
>   address_id integer not null references addresses,
>   owner_type_id char(1) not null check (owner_type_id in ('P','C')),
>   owner_id integer not null,
>   --
>   primary key (owner_id,owner_type_id,address_id),
>   --
>   person_id generated always as (decode(owner_type_id,'P',owner_id)) virtual
>     constraint address_fk_people references people,
>   company_id generated always as (decode(owner_type_id,'C',owner_id)) virtual
>     constraint address_fk_companies references companies
> );

Table created.

> insert into people (id,first_name,last_name) values (1,'John','Doe');

1 row created.

> insert into companies (id,name) values (101,'DB Oriented');

1 row created.

> insert into addresses (id,street,city,country) values (1,'Dekel','Zurit','Israel');

1 row created.

> insert into entity_addresses (address_id,owner_type_id,owner_id) 
> values (1,'P',1);

1 row created.

> insert into entity_addresses (address_id,owner_type_id,owner_id) 
> values (1,'P',101);
insert into entity_addresses (address_id,owner_type_id,owner_id) values (1,'P',101)
*
ERROR at line 1:
ORA-02291: integrity constraint (DEMO5.ADDRESS_FK_PEOPLE) violated - parent key not found


> insert into entity_addresses (address_id,owner_type_id,owner_id) 
> values (1,'C',1);
insert into entity_addresses (address_id,owner_type_id,owner_id) values (1,'C',1)
*
ERROR at line 1:
ORA-02291: integrity constraint (DEMO5.ADDRESS_FK_COMPANIES) violated - parent key not found


> insert into entity_addresses (address_id,owner_type_id,owner_id) 
> values (1,'C',101);

1 row created.

Hiding the Virtual Columns

Since the virtual columns in this solution are just for enforcing the referential integrity, I prefer to hide them, so adding new entities to the arc (or removing entities from it) will be transparent to the application.

Invisible Columns

In Oracle 12c we can do it by making these columns invisible.

> alter table entity_addresses modify (
>   person_id invisible,
>   company_id invisible
> );

Table altered.

> select * from entity_addresses;

ADDRESS_ID OWN   OWNER_ID
---------- --- ----------
         1 P            1
         1 C          101

Editioning Views

Being an Edition-Based Redefinition (EBR) evangelist, my application code never refers to tables. Each table is covered by an editioning view, and the application code refers only to views. This practice enables to hide the virtual columns very easily, simply by not including them in the editioning view:

> create editioning view entity_addresses_v as
>   select address_id,
>          owner_type_id,
>          owner_id
>   from entity_addresses;

View created.

Unfortunately, EBR is not widely used 🙁
You can, however, use the same idea in non-EBR environment, by creating a regular view rather than an editioning view, and refer to this view from the application code.

3 Points about Indexes and Order

When dealing with indexes, order may be very significant, in several aspects.
Perhaps it’s not surprising after all when talking about a data structure that its purpose is to keep data in order…
We’ll refer to three different points:
1. Is the order of columns in a composite index important?
2. Which order is better: filling a table with data and then creating indexes, or creating indexes and then filling the table with data?
3. When creating several indexes, is the order of creation important?
Not always there is one correct answer that covers all the different cases, but it is always worth asking the questions before rushing to execute.

Is the order of columns in a composite index important?
Certainly yes. Let’s take a look at the following two indexes – they both contain the same two columns, but in a different order:

CREATE INDEX T_IDX_1 ON T (COL_A,COL_B);
CREATE INDEX T_IDX_2 ON T (COL_B,COL_A);

Let’s assume that the table T contains many records and that the following queries are highly selective (i.e., they return a relatively small number of records from the table), so it is likely that using an index is better than full scanning the table.

For the following query both indexes are equally good:

SELECT *
FROM T
WHERE COL_A = :VAL1
  AND COL_B = :VAL2;

But for the following query, the index T_IDX_1 is good, while T_IDX_2 is not:

SELECT *
FROM T
WHERE COL_A = :VAL1;

Although the index T_IDX_2 contains the column COL_A, it is not enough, as it does not contain it in its leading part. The order of columns in the index matters.
It’s easy to understand the difference when thinking of the following example: in the phone book the names are ordered first by last name, and then by first name. We can find very quickly all the subscribers whose last name is “Nakdimon”, but we must scan the whole phone book for finding all the subscribers whose first name is “Oren”.

Note: if the table contains a relatively small number of different unique values in the column COL_B, Oracle will still be able to use the index T_IDX_2 for answering the last query by using the Index Skip Scan access path, but still using T_IDX_1 with Index Range Scan will be better.

For the next two questions we’ll consider the following case: we need to create a new table with the following two indexes, and fill it with plenty of data:

CREATE TABLE T (
   COL_A NUMBER,
   COL_B DATE,
   COL_C VARCHAR2(4000),
   …
);
CREATE INDEX T_IDX_A ON T (COL_A);
CREATE INDEX T_IDX_BA ON T (COL_B,COL_A);

Which order is better: filling a table with data and then creating indexes, or creating indexes and then filling the table with data?
Usually working in the former order (creating the indexes when the table is already filled) will take less time than working in the latter order (filling the table when the indexes already exist), since in the latter case the indexes need to be maintained with the insertion of each and every record.

When creating several indexes, is the order of creation important?
Here the answer is positive in certain cases.
Suppose that we created the table T and filled it with many rows, and now it’s time to create the indexes. We can create T_IDX_A first and T_IDX_BA second, or vice versa. Let’s examine both options:

Option 1:

We’ll create T_IDX_A first. For that, Oracle will do a Full Table Scan of T (and will take the value of COL_A from every record, and of course the ROWID of every record to know where to point to from the index)
Now we’ll create T_IDX_BA. Once again, Oracle will do a Full Table Scan of T (and will take the values of COL_B and COL_A and the ROWID from every record)

Option 2:

We’ll create T_IDX_BA first. For that, Oracle will do a Full Table Scan of T (and will take the values of COL_B and COL_A and the ROWID from every record)
Now we’ll create T_IDX_A, and this is where the plot changes. Oracle can do a Full Table Scan of T here as well, but in this case it has another alternative, a better one in most cases. The only details that are needed in order to build the index are the values of COL_A and the ROWID of all the records in the table (where COL_A is not null), and these details already exist in the index T_IDX_BA. Therefore, Oracle can do an Index Fast Full Scan of T_IDX_BA, instead of Full Table Scan of the table.

So, if all the columns of one index are included in a second index, it is recommended to create the second index first, and only then the first index, and enable Oracle to consider more alternatives. The more the number of columns in the table that do not exist in the indexes, the more significant the improvement in the creation time of the second index (by doing Index Fast Full Scan instead of Full Table Scan) is.

You are most welcome to comment or to ask questions in this page, or to write me at oren@db-oriented.com.

Implementing One-to-One Relationships

In order to implement a one-to-many relationship we use a foreign key constraint.

To implement a many-to-many relationship Continue reading “Implementing One-to-One Relationships”