Sunday, February 20, 2011

Hash Outer Join : Oracle Tuning Tip#22


TOPIC:
Hash Outer Join

DEFINITION:

As per the definition of hash outer join, the driving (or parent) table whose rows are being preserved is used to build the hash table, and the driven (or child) table is used to probe the hash table.

To perform this hash outer join, Oracle follows these steps (assume, child table has to be outer-joined with parent table):
  • 1.     Oracle chooses parent table as the hash table (otherwise called as driving table). Oracle built the hash table in RAM after applying the hash function on the joining column(s) of the driving table.
  • 2.     Oracle chooses child table as the probe table (otherwise called as driven table or probing table). It traverse through all the records of this probe table, applies the same hash function on the joining column(s) [column(s) used to join these two tables] and will hit the corresponding entry in the hash table.
  • 3.     Oracle returns the output if a record from the driving table is already present in the same hash key and will flag this row as “MATCHED” in the parent hash table else no record will be returned. [ I used “MATCHED” flag term to explain this step in an easy way].
  • 4.     After all the records of driven table is processed, Oracle will traverse through parent driving table once again and the rows not marked as “MATCHED” will be returned along with NULL values for child probing table’s columns mentioned in the query.


As you aware, in hash equi-join, deciding which table is going to be the hash table is the deciding factor. Generally, the smallest of the two tables will be considered as the driving table so that cost will be low. [cost will be low because the smallest of two tables will fit perfectly in RAM, occupy less amount of memory space in RAM and child table will completely be traversed in a single pass]. However, in this hash outer join, this decision making process becomes immaterial since Parent table will be chosen as the hash table by default irrespective of whether it is a small or big table. Reason is, joining condition determines which table is going to be the hash table in this hash outer join instead of cost which is the metric used to identify the hash table in the hash equi-join.

The logical activity diagram for this methodology will be like this,

Read all the required data from the parent table and built the hash table in RAM after applying hash function
Loop (for all the required records in the child table)
  • ·         Apply the hash function for each record to get the address [hashed value] in RAM
  • ·         Hit RAM in the corresponding address
  • ·         If a record from the driving table is already present in the same hash key, flag this record of parent hash table as “MATCHED” and return the columns from both the tables mentioned in the query else go the next record in child table

End loop
Loop (for all the records in the parent hash table which are not flagged as “MATCHED”)
  • ·         Return the columns from the hash table along with NULL values for child probing table’s columns mentioned in the query

End loop

LITTLE-KNOWN FACTS TO BE REMEMBERED:
·         /*+ USE_HASH(<<driving table>>) */ is the hint that can be used to impose this hash outer join.
·         This methodology is opted by Oracle only if both the tables are joined by outer join “(+)” operator.
·         In RAM, hash outer join will be carried out only in the allocated memory space which is nothing but HASH_AREA_SIZE component of PGA.
·         Only driving table will be inserted in the hash table in RAM and the driven table will never be written into the hash table in RAM. Hash function will be applied in the driven table only to locate the corresponding memory entry in the hash table to look for the matching entry from the driving table there.
·         This is very successful when one table is smaller and another table is bigger. In this case, it outplays the nested loop outer join.
·         If the driving table cannot be hashed in RAM in single pass (i.e., at one shot or one go), then a portion of the hash-table spills to disk(actually, TEMP tablespace will be used here to hold that spilled dataset from the driving table). When the hash table is probed by the driven table, the rows with join keys that match those parts of the in-memory hash table are joined immediately; the rest [of the driven table] are also written into TEMP tablespace and joined in the second pass. The bigger driving table is, the smaller the proportion of the hash table that can fit in RAM, the remaining data will be spilled into TEMP tablespace and have to go through the subsequent passes till it processes all the records spilled into TEMP tablespace from both these driving and driven table. This slows the Hash Join down considerably and also makes the join non-scalable in this scenario.

ADVANTAGE:
·         If the driving table is small enough to fit in RAM, outer join “(+)” operator is used in the query & the driven table is very big, then the hash outer join outplays the nested loop outer join.
·         Still nested loop outer join will be the fastest way to retrieve the first matching record [since hash outer join takes some time to build the hash table in RAM for the driving table as it applies hash function] but hash outer join will be the fastest way to get all the matching records quickly when compared to the nested loop outer join. In general, hash function (in hash outer join) will perform better and more efficiently.

DISADVANTAGE:
·         When this is opted for joining 2 big tables, TEMP tablespace will be used extensively as the spilled dataset from both driving and driven tables will be written into TEMP tablespace
·         The above scenario makes hash outer join as more non-scalable as it has to go through subsequent passes to join these spilled dataset (available in TEMP tablespace) from both these tables

HOW TO VERIFY:
How to verify whether Oracle follows hash outer join or not while executing the sql query. If a query follows this, then you will find similar execution plan like this,

--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                     | Name                                             | Rows  |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                   |                                                          |            |
|   1 |    HASH JOIN (OUTER)                               |                                                          |            |
|   2 |       TABLE ACCESS (FULL)                          | <<parent table’s name>>     |            |
|   3 |       TABLE ACCESS (FULL)                          | <<child table’s name>>          |            |
In the explain plan, whenever it chooses this methodology, it displays the keyword “HASH JOIN (OUTER)” in the operation column. Whichever the table name that is getting displayed immediately after this word is nothing but the driving table (otherwise called as hash table) and that’s why you will find parent table’s name in that position followed by the driven table’s name (otherwise called as probe table).


EXAMPLE:

In order to understand how the mechanism of hash outer join is different from hash equi-join, I am taking the same example which I used there (Tip#20) with a few modifications and imposing the outer join.

Create 2 tables.
One table is store the information about all the employees and inserts 6 records.
Another table is store the information of the sales done by the employees and inserts 3 records.
The tables will look like this,

DATA TABLE:
EMP:
ROWID
Empid
Empname
AAAAA1
3825
CHARLES
AAAAA2
9827
FERGUSON
AAAAA3
2389
NADAL
AAAAA4
1784
ERIN
AAAAA5
4556
ROONEY
AAAAA6
8711
TOMMY

SALES:
ROWID
Empid
Sales_Amt
AAAAA7
9827
6000
AAAAA8
2389
8500
AAAAA9
5642
4000


Fire this query against these tables where the requirement is to display the employee’s name & sales_amt (if they have done) of all the employess available in EMP table,
Select a.empname, b.sales_amt
from emp a, sales b
where a.empid = b.empid(+);

since outer-join “(+)” operator is used , hash join will be used to join these 2 tables.

As per this query, emp is the parent table and bonus is the child table since the latter[sales.empid(+)] is outer-joined with the former. So, oracle will consider emp as the driving able and bonus as the driven table though the latter is smaller than the former. That’s the reason cost is not the deciding factor and join condition is the deciding factor in hash outer join.

Assume, Oracle can hold 10 records at a time in the allocated memory space(defined by HASH_AREA_SIZE global parameter) in RAM.
For easy understanding, pls assume the memory address will be from 0 to 9 to hold the records in the hash table.
For easy understanding, pls assume MOD function to the base 10 will be used as hash function here. In real world, the hash function would be more complex and hard to decode it.

EMPTY HASH TABLE:
=================

As per the design of hash outer join, the following steps will be followed while executing this query,
1.     Since SALES is outer-joined with EMP, EMP is considered as the driving table and SALES is considered as the driven table
2.     For the first record in EMP, hash function is applied and it returns 5 as the hashed value (i.e.,hash(3825)èmod(3825/10)è5)
3.     So, this record will be inserted as the 6th record in hash table
4.     For the next record in EMP, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)èmod(9827/10)è7)
5.     So, this record will be inserted as the 8th record in hash table
6.     For the next record in EMP, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)èmod(2389/10)è9)
7.     So, this record will be inserted as the 10th record in hash table
8.     For the next record in EMP, hash function is applied and it returns 4 as the hashed value (i.e.,hash(1784)èmod(1784/10)è4)
9.     So, this record will be inserted as the 5th record in hash table
10.  For the next record in EMP, hash function is applied and it returns 6 as the hashed value (i.e.,hash(4556)èmod(4556/10)è6)
11.  So, this record will be inserted as the 7th record in hash table
12.  For the next record in EMP, hash function is applied and it returns 1 as the hashed value (i.e.,hash(8711)èmod(8711/10)è1)
13.  So, this record will be inserted as the 2nd record in hash table. So, the hash table will look like this at the end of this step,

14.  For the first record in SALES, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)èmod(9827/10)è7)
15.  So, it hits 8th record of hash table
16.  Since there is a record from EMP is available in 8th record of hash table, this record will be flagged as “MATCHED” and data will be returned in the output
17.  For the second record in SALES, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)èmod(2389/10)è9)
18.  So, it hits 10th record of hash table
19.  Since there is a record from EMP is available in 10th record of hash table, this record will be flagged as “MATCHED” and data will be returned in the output
20.  For the third(last) record in SALES, hash function is applied and it returns 2 as the hashed value (i.e.,hash(5642)èmod(5642/10)è2)
21.  So, it hits 3rd record of hash table
22.  Since there is no record from EMP is available in 3rd record of hash table, no data will be returned in the output

23.  All the records from hash table is being processed once more those are not flagged as “MATCHED”, all those records (empids:8711,1784,3825,4556) from the hash table will be returned along with NULL for the child table’s columns mentioned [sales_amt column in this case] in the query.
24.  It exits.

So, the output will look like this,
Empname
Sales_amt
FERGUSON
6000
NADAL
8500
TOMMY
ERIN
CHARLES
ROONEY


If you close look at the output, you can see that the output is displayed in such a way that, matching records are getting displayed at first and then followed by the unmatching records available only in EMP table.

Now, explain table will look like this,
--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                     | Name                                             | Rows  |
--------------------------------------------------------------------------------------------------------
|   4 | SELECT STATEMENT                                   |                                                          |  6          |
|   3 |    HASH JOIN (OUTER)                               |                                                          |  6          |
|   1 |          TABLE ACCESS (FULL)                       | EMP                                               |  6          |
|   2 |          TABLE ACCESS (FULL)                       | SALES                                             |  3          |
From the explain plan, we can say that both the tables are joined by hash outer join methodology, EMP is considered as the driving table [hash table] and SALES is considered as the driven table [probe table].

Saturday, February 19, 2011

Nested Loop Outer Join : Oracle Tuning Tip#21



TOPIC:
Nested Loop Outer Join

DEFINITION:

Before we go into this topic in detail, first we will understand what outer join is meant by.
For an example, take 2 tables which have to be joined, say “Parent” table & “Child” table [I have taken these table names, only to discuss this topic in an easy way].

If Child table is outer joined with Parent table, then you can say, the matching records from both these tables will be retrieved and then the records which are available only in Parent table will also be retrieved though the corresponding matching records are not available in Child table. This is called outer join. [in natural or equi-join, only the matching records from both the tables will be retrieved.]

To perform this outer join, 3 different join techniques have been provided by oracle.
  • ·         Nested loop outer join
  • ·         Hash outer join
  • ·         Sort merge outer join

In this, we will discuss about nested loop outer join.

To perform this nested loop outer join, Oracle follows these steps (assume, child table has to be outer-joined with parent table):
  • 1.     The optimizer chooses parent table as the outer table, or the driving table. Child table is chosen as the inner table (or the driven table).
  • 2.     For each row in the driving table, if a matching record exists in the driven table, columns (mentioned in the query) from both the tables will be returned.
  • 3.     If a matching record doesn’t exist in the driven table, columns from parent table alone will be returned and NULL will be returned in the places of child table’s columns mentioned in the query. 
As you aware, in the nested loop equi-join, deciding which table is going to be the driving table is the deciding factor. Generally, the smallest of the two tables will be considered as the driving table so that cost will be low. [cost will be low because only minimal number of looping has to be done]. However, in this nested loop outer join, this decision making process becomes immaterial since Parent table will be chosen as the driving table by default irrespective of whether it is a small or big table. Reason is, joining condition determines which table is going to be the driving table in this nested loop outer join instead of cost which is the metric used to identify the driving table in the nested loop equi-join.

The logical activity diagram for this methodology will be like this,

LOOP (for all the required rows in the driving table)
  • ·         IF (a matching record from the driven table is available) THEN

Ø  output the matching record and display both the tables’ columns mentioned in the query;
  • ·         ELSE (a matching record from the driven table is not available)

Ø  Output this unmatched driving table’s record anyway and display NULL for the driven table’s columns mentioned in the query;
  • ·         END IF;

END LOOP;

LITTLE-KNOWN FACTS TO BE REMEMBERED:
·         /*+ USE_NL(<<inner table>>) */ is the hint that can be used to impose this nested loop outer join.
·         This methodology is opted by Oracle when the tables are joined by outer join “(+)” operator.
·         This is very successful when the parent table is smaller, child table is bigger and child table is outer-joined with parent table

ADVANTAGE:
·         If your requirement is to see the initial matching records (though outer-join is mentioned) as quick as possible, this is the best methodology to rely on. Reason is, as per the logic, though it is kinda having the iterative structure, the matching records would be displayed in each iteration.
·         If parent table is small, then looping has to be done for minimal number of times and this technique will work efficiently.

DISADVANTAGE:
·         When this is opted for joining 2 big tables, you may see the initial matching records very fast but it would take a lot of time to display the final matching records.
·         This is very resource intense process (especially CPU is utilized a lot) since it has the iterative structure in the process

HOW TO VERIFY:
How to verify whether Oracle follows nested loop outer join or not while executing the sql query. If a query follows this, then you will find similar execution plan like this,

--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                     | Name                                             | Rows  |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                   |                                                          |            |
|   1 |    NESTED LOOPS (OUTER)                       |                                                          |            |
|   2 |       TABLE ACCESS (FULL)                          | <<parent table’s name>>      |            |
|   3 |       TABLE ACCESS (FULL)                          | <<child table’s name>>          |            |
In the explain plan, whenever it chooses this methodology, it displays the keyword [NESTED LOOPS (OUTER)] in the operation column. Whichever the table name that is getting displayed immediately after this word is nothing but the driving table and that’s why you will find parent table’s name in that position followed by child table’s name.

EXAMPLE:

In order to understand how the mechanism of nested loop outer join is different from nested loop equi-join, I am taking the same example which I used there (Tip#18) with a few modifications and imposing the outer join.
Create an employee table and inserts 10 records. The table will look like this,

DATA TABLE:
ROWID
Empid
column2
column3
salary
Empname
AAAAA1
5
..
..
1130
CHARLES
AAAAA2
10
..
..
1890
JOHN
AAAAA3
9
..
..
1800
NADAL
AAAAA4
4
..
..
1270
ERIN
AAAAA5
6
..
..
1900
FEGRUSON
AAAAA6
7
..
..
1740
MIKE
AAAAA7
1
..
..
1620
PETER
AAAAA8
8
..
..
1290
SILVA
AAAAA9
2
..
..
1580
MARTHA
AAAAA10
3
..
..
1200
BECKY

Create an unique index on this table for empid column. (create unique index emp_no_indx on emp(empid)).
Index table will logically look like this,

UNIQUE INDEX TABLE:
INDEX
ROWID
1
AAAAA7
2
AAAAA9
3
AAAAA10
4
AAAAA4
5
AAAAA1
6
AAAAA5
7
AAAAA6
8
AAAAA8
9
AAAAA3
10
AAAAA2

First column (INDEX) : it stores all the values of empid column in the ascending order.
Second column (ROWID) : it stores the ROWID of the corresponding record.

Then, Create an another table called, “bonus” which stores only the information about the employees who are getting the bonus and how much they get. The table will look like this,

DATA TABLE:
ROWID
Empid
Group
Bonus_amt
AAAAB1
6
SERVICE
90
AAAAB2
3
SERVICE
80
AAAAB3
9
HR
80


Fire this query against these tables where the requirement is to display the name, salary, bonus (if they are getting) of all the employess, so the query will be like this,
Select emp.empname,emp.salary,bonus.bonus_amt
from emp, bonus
where emp.empid = bonus.empid(+);

As per this query, emp is the parent table and bonus is the child table since the latter[bonus.empid(+)] is outer-joined with the former.
So, oracle will consider emp as the driving able and bonus as the driven table though no indexes are available in the latter. That’s the reason cost is not the deciding factor and join condition is the deciding factor in nested loop outer join.

As per the design of nested loop outer join, the following steps are followed while executing this query,
1.     Oracle chooses “emp” table as the driving table and “bonus” table as the driven table.
2.     It gets the first record of “emp” table, that is empid:5.
3.     It hits “bonus” table and try to find the matching record for empid:5.
4.     Since there is no empid:5 in “bonus” table, Empname & salary columns from “emp” table gets displayed and nothing gets displayed for bonus_amt column.
5.     It gets the next record of “emp” table, that is empid:10.
6.     It hits “bonus” table and try to find the matching record for empid:10.
7.     Since there is no empid:10 in “bonus” table, Empname & salary columns from “emp” table gets displayed and nothing gets displayed for bonus_amt column.
8.     It gets the next record of “emp” table, that is empid:9.
9.     It hits “bonus” table and try to find the matching record for empid:9.
10.  Since there is empid:9 in “bonus” table, Empname & salary columns from “emp” table gets displayed and bonus_amt column from “bonus” table gets displayed.
11.  It gets the next record of “emp” table, that is empid:4.
12.  It hits “bonus” table and try to find the matching record for empid:4.
13.  Since there is no empid:4 in “bonus” table, Empname & salary columns from “emp” table gets displayed and nothing gets displayed for bonus_amt column.
14.  It gets the next record of “emp” table, that is empid:6.
15.  It hits “bonus” table and try to find the matching record for empid:6.
16.  Since there is empid:6 in “bonus” table, Empname & salary columns from “emp” table gets displayed and bonus_amt column from “bonus” table gets displayed.
17.  It gets the next record of “emp” table, that is empid:7.
18.  It hits “bonus” table and try to find the matching record for empid:7.
19.  Since there is no empid:7 in “bonus” table, Empname & salary columns from “emp” table gets displayed and nothing gets displayed for bonus_amt column.
20.  It gets the next record of “emp” table, that is empid:1.
21.  It hits “bonus” table and try to find the matching record for empid:1.
22.  Since there is no empid:1 in “bonus” table, Empname & salary columns from “emp” table gets displayed and nothing gets displayed for bonus_amt column.
23.  It gets the next record of “emp” table, that is empid:8.
24.  It hits “bonus” table and try to find the matching record for empid:8.
25.  Since there is no empid:8 in “bonus” table, Empname & salary columns from “emp” table gets displayed and nothing gets displayed for bonus_amt column.
26.  It gets the next record of “emp” table, that is empid:2.
27.  It hits “bonus” table and try to find the matching record for empid:2.
28.  Since there is no empid:2 in “bonus” table, Empname & salary columns from “emp” table gets displayed and nothing gets displayed for bonus_amt column.
29.  It gets the next record of “emp” table, that is empid:3.
30.  It hits “bonus” table and try to find the matching record for empid:3.
31.  Since there is empid:3 in “bonus” table, Empname & salary columns from “emp” table gets displayed and bonus_amt column from “bonus” table gets displayed.
32.  Since there is no more records to be processed in “emp” table, it exits.

So, the output will look like this,
Empname
salary
Bonus_amt
CHARLES
1130
JOHN
1890
NADAL
1800
80
ERIN
1270
FEGRUSON
1900
90
MIKE
1740
PETER
1620
SILVA
1290
MARTHA
1580
BECKY
1200
80


Now, explain table will look like this,

--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                          | Name                                             | Rows  |
--------------------------------------------------------------------------------------------------------
|   4 | SELECT STATEMENT                                        |                                                          |     10     |
|   3 |   NESTED LOOPS (OUTER)                             |                                                          |     10     |
|   1 |     TABLE ACCESS (FULL)                                 | EMP                                                |     10     |
|   2 |     TABLE ACCESS (FULL)                                 | BONUS                                          |     4       |

From the explain plan, we can say that both the tables have been outer-joined by nested loop outer join methodology.”Emp” table has been considered as the driving table and “Bonus” table as the driven table.