Sunday, June 26, 2011

Nested Loop Semi Join : Oracle Tuning Tip#25



TOPIC:
Nested Loop Semi Join

DEFINITION:

Before we discuss this topic in detail, we will try to understand what is meant by “semi-join”.

A “semi-join” between two tables returns rows from the first table where one or more matches is found in the second table. The difference between a semi-join and a conventional join is that rows in the first table will be returned at most once. Even if the second table contains two or more matches for a row in the first table, only one copy of the row from the first table will be returned.

Semi-joins are written using the EXISTS or IN constructs.

The rule of thumb goes like this:
1.     If the main body of your query is highly selective, then an EXISTS clause might be more appropriate to semi-join to the target table(subquery).
(In this case, it would be ideal to write the query in Correlated subquery format which is nothing but the column(s) from main query is referred in subquery)
2.     However, if the main body of your query is not so selective and the subquery (the target of the semi-join) is more selective, then an IN clause might be more appropriate.
(In this case, it would be ideal to write the query in nested subquery format (otherwise called as un-correlated subquery format) which is nothing but the column(s) from main query is not referred in subquery so both main query and subquery are mutually exclusive standalone entities in the statement.)

As mentioned in Tip#18, Semi-join can be implemented by these 3 joining techniques,
·         Nested Loop Semi Join
·         Hash Semi Join
·         Sort Merge Semi Join

We will discuss about Nested Loop Semi Join in this forum.

In a conventional semi-join, the nested loops algorithm is desirable when the predicate(s) on the first table(main table) are very selective and the join column(s) in the second table(table in the subquery) are selectively indexed. The merge and hash join algorithms, meanwhile, are more desirable when predicates are not very selective in the first table(main table) or the join columns in the second table(table in the subquery) are not selectively indexed.

To perform nested loop semi join, Oracle follows these steps:
1.     The optimizer chooses the table in the main query as the outer table (or the driving table).
2.     Then, it chooses the table in the subquery as the inner table (or the driven table).
3.     For each row in the driving table, Oracle tries to find a matching record in the driven table that satisfy the join condition.
4.     If atleast one matching record exists in the driven table, then Oracle returns that record from the driving table in the output.

So, in this approach, finalizing the driving table & driven table is straightforward. As per the logic, 3rd & 4th steps are going to executed in a loop and this number of iteration has to be reduced as much as possible in order to improve the performance. So, this can be achieved only if the predicate(s) on the driving table is highly selective and the joining column(s) in the driven table is selectively indexed. Thus, in this kind of scenarios, nested loop semi join will perform better.

The logical activity diagram for this methodology will be like this,

Loop (for all the required records in the driving table)
Set Flag = false
Ø  Loop (for all the corresponding records of the driven table)
·         If (driving table.record matches driven table.record) then   è [Semi join optimization over the Conventional join]
·         Set Flag = true
·         Exit from this inner loop alone
·         End if
Ø  End inner loop
Return the record of the driving table only if flag = true
End outer loop

In the above logical activity diagram, the step highlighted in red is the main reason for semi join to perform better when compared to conventional join (natural or equi-join). The inner loop will be executed at most once if there is any matching record(s) in the driven table(table in the subquery). So, Oracle will move on to the next record in the driving table as soon as it finds the first matching record in the driven table, unlike conventional join wherein all the matching records of the driven table is found out. Since no column(s) from the driven table(table in the subquery) is going to be displayed in the output, there is no point in reading all the matching record(s) from the driven table and that’s what semi join does. Once it finds the first matching record from the driven table, it sets flag = true and exits the inner loop.

LITTLE-KNOWN FACTS TO BE REMEMBERED:
·         /*+ NL_SJ */ is the hint that can be used to impose this nested loop semi join. But this hint has to be applied to the subquery of the EXISTS or IN clause, not the main query itself.
·         No column(s) from the subquery can be returned in the output.
·         This is very successful when the outer table is smaller and the joining column(s) of the inner table is properly indexed.
·         If the query contains DISTINCT clause, then Oracle cannot impose this semi-join though EXISTS (or) IN clause is mentioned in the query.
·         If the query contains UNION set operator, then Oracle cannot impose this semi-join though EXISTS (or) IN clause is mentioned in the query.
·         If the EXISTS (or) IN clause is part of an OR branch in the query, then Oracle cannot impose this semi-join.
·         However this semi-join can be imposed in queries that contain OR clause(s) in the WHERE clause, just as long as the EXISTS (or) IN is not part of the OR.
·         Even the operator, “=ANY” can be used to impose this semi-join apart from IN & EXISTS clauses.
·         The statement must have a subquery in the IN (or) EXISTS clause in order to invoke this semi-join. It means, usual IN clause wherein hardcoded value(s) mentioned in R.H.S (Right Hand Side) of this clause cannot invoke this semi-join.

ADVANTAGE:
·         If a query is the best candidate for nested loop semi join, then this semi join can reduce logical reads, physical reads, CPU time & elapsed time dramatically.
·         If your requirement is to see the initial matching records as quick as possible, this is the best methodology to rely on. Reason is, as per the logic, though it is kinda of having the iterative structure, the matching records would be displayed in step-4 of each iteration.

DISADVANTAGE:
·         When this is opted for joining 2 big tables, you may see the initial matching records very fast but it would take a lot of time to display the final matching records.

HOW TO VERIFY:
How to verify whether Oracle follows nested loop semi join or not while executing SQL. If a query follows this, then you will find similar execution plan like this,

--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                     | Name                                             | Rows  |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                   |                                                          |            |
|   1 |    NESTED LOOPS (SEMI)                           |                                                          |            |
|   2 |       TABLE ACCESS (FULL)                          | <<outer table’s name>>        |            |
|   3 |       TABLE ACCESS (FULL)                          | <<inner table’s name>>         |            |
In the explain plan, whenever it chooses this methodology, it displays the keyword [NESTED LOOPS (SEMI)] in the operation column. Whichever the table name that is getting displayed immediately after this word is nothing but the driving (outer) table and the next one is the driven (inner) table.

EXAMPLE:
Create an employee table and inserts 10 records. The table will look like this,

DATA TABLE:
ROWID
empid
empname
column3
column4
deptid
AAAAA1
5
CHARLES
..
..
200
AAAAA2
10
JOHN
..
..
100
AAAAA3
9
NADAL
..
..
100
AAAAA4
4
ERIN
..
..
220
AAAAA5
6
MATHEWS
..
..
200
AAAAA6
7
JAMAL
..
..
100
AAAAA7
1
PETER
..
..
220
AAAAA8
8
LUSY
..
..
200
AAAAA9
2
MARTHA
..
..
200
AAAAA10
3
ROSS
..
..
100

Create an index on this table for deptid column. (create index dept_id_indx on emp(deptid)).
Index table will logically look like this,

NON-UNIQUE INDEX TABLE:
INDEX
ROWID
100
AAAAA2
AAAAA3
AAAAA6
AAAAA10
200
AAAAA1
AAAAA5
AAAAA8
AAAAA9
220
AAAAA4
AAAAA7

First column (INDEX) : it stores all the distinct values of deptid column in the ascending order.
Second column (ROWID) : it stores the ROWID of the corresponding record(s) for that INDEX value.

Then, Create a dept table and insert 3 records. The table will look like this,

DATA TABLE:
ROWID
Deptid
Deptname
AAAAB1
100
Sales
AAAAB2
200
Delivery
AAAAB3
300
Legal


Fire this query against these table where the requirement is to display only deptid & deptname if atleast one employee is working on that department,
Select dept.deptid, dept.deptname
from dept  
where exists (select ‘X’ from emp where emp.deptid = dept.deptid);

When oracle executes this sql, since EXISTS clause is mentioned in the query, dept table is very small & deptid column of emp is indexed, Oracle will impose nested loop semi join.
As per the design of nested loop semi join, the following steps will be followed while executing this query,
1.     Oracle chooses “dept” table as the driving table and “emp” table as the driven table.
2.     It gets the first record of “dept” table, that is deptid:100.
3.     It hits the index table of “emp” table in order to find out if any index record is available with this value.
4.     Once it sees the first ROWID (AAAAA2) of the first index record (deptid:100), it stops referring this index table & returns both the columns of “dept” table in the output.
5.     It gets the second record of “dept” table, that is deptid:200.
6.     It hits the index table of “emp” table in order to find out if any index record is available with this value.
7.     Once it sees the first ROWID (AAAAA1) of the second index record (deptid:200), it stops referring this index table & returns both the columns of “dept” table in the output.
8.     It gets the third (and last) record of “dept” table, that is deptid:300.
9.     It hits the index table of “emp” table in order to find out if any index record is available with this value.
10.  Since no index records has this value (deptid:300), this record of “dept” table is not returned in the output.
11.  Since there are no more records to be processed in “dept” table, it exits.

So, the output will look like this,
deptid
deptname
100
Sales
200
Delivery

If you closely look at the output, you can see both these departments get displayed only once though each has 4 employees. This is what the expected outcome for this requirement & nested loop semi join does the same.

Now, explain plan will look like this,
--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                 | Name                                                       | Rows  |
--------------------------------------------------------------------------------------------------------
|   4 | SELECT STATEMENT                               |                                                                    |     2       |
|   3 |   NESTED LOOPS (SEMI)                        |                                                                    |     2       |
|   1 |     TABLE ACCESS (FULL)                        | DEPT                                                         |     3       |
|   2 |        INDEX (RANGE SCAN)                    | DEPT_ID_INDX (NON-UNIQUE)    |     1       |

From the explain plan, we can say that both the tables are joined by nested loop semi join methodology.”Dept” table has been considered as the driving table and “Emp” table as the driven table. From this explain plan, we can also see that the actual data table of “Emp” has not been referred at all. The reason for this is, no column(s) from “Emp” table is needed to be displayed in the output and the joining column (deptid) is already in the index table so there is no need to refer its actual data table. So, this will improve the performance further. In addition to this, in “Rows” column of the explain plan, against “dept_id_indx” index table, the displayed value, “1” tells that, always the first ROWID of the index record is alone referred if there is a matching condition.

The same requirement can be met through the query with the usual conventional join like this,
Select distinct dept.deptid, dept.deptname
from dept, emp
where emp.deptid = dept.deptid;

Even this query will also return the same expected output but the explain plan will look like this,
--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                | Name                                                       | Rows  |
--------------------------------------------------------------------------------------------------------
|   5 | SELECT STATEMENT                              |                                                                    |     2       |
|   4 | UNIQUE                                                     |                                                                    |     2       |
|   3 |   NESTED LOOPS                                     |                                                                    |     8       |
|   1 |     TABLE ACCESS (FULL)                       | DEPT                                                        |     3       |
|   2 |        INDEX (RANGE SCAN)                   | DEPT_ID_INDX (NON-UNIQUE)   |     4       |

The difference is that, the value displayed against “dept_id_indx” index table in “Rows” column of the explain plan. The displayed value, “4” in the explain plan tells that, all the four ROWIDs of both the index records (deptid:100 & deptid:200) are referred but the duplicate values are suppressed by “distinct” clause that requires some additional work. “UNIQUE” operation in the explain plan tells that DISTINCT clause is mentioned in the query and it suppresses the duplicate records from 8 to 2. This is the reason we should opt for nested loop semi join for this type of scenarios rather than going for the usual conventional join.