Sunday, February 20, 2011

Hash Outer Join : Oracle Tuning Tip#22


TOPIC:
Hash Outer Join

DEFINITION:

As per the definition of hash outer join, the driving (or parent) table whose rows are being preserved is used to build the hash table, and the driven (or child) table is used to probe the hash table.

To perform this hash outer join, Oracle follows these steps (assume, child table has to be outer-joined with parent table):
  • 1.     Oracle chooses parent table as the hash table (otherwise called as driving table). Oracle built the hash table in RAM after applying the hash function on the joining column(s) of the driving table.
  • 2.     Oracle chooses child table as the probe table (otherwise called as driven table or probing table). It traverse through all the records of this probe table, applies the same hash function on the joining column(s) [column(s) used to join these two tables] and will hit the corresponding entry in the hash table.
  • 3.     Oracle returns the output if a record from the driving table is already present in the same hash key and will flag this row as “MATCHED” in the parent hash table else no record will be returned. [ I used “MATCHED” flag term to explain this step in an easy way].
  • 4.     After all the records of driven table is processed, Oracle will traverse through parent driving table once again and the rows not marked as “MATCHED” will be returned along with NULL values for child probing table’s columns mentioned in the query.


As you aware, in hash equi-join, deciding which table is going to be the hash table is the deciding factor. Generally, the smallest of the two tables will be considered as the driving table so that cost will be low. [cost will be low because the smallest of two tables will fit perfectly in RAM, occupy less amount of memory space in RAM and child table will completely be traversed in a single pass]. However, in this hash outer join, this decision making process becomes immaterial since Parent table will be chosen as the hash table by default irrespective of whether it is a small or big table. Reason is, joining condition determines which table is going to be the hash table in this hash outer join instead of cost which is the metric used to identify the hash table in the hash equi-join.

The logical activity diagram for this methodology will be like this,

Read all the required data from the parent table and built the hash table in RAM after applying hash function
Loop (for all the required records in the child table)
  • ·         Apply the hash function for each record to get the address [hashed value] in RAM
  • ·         Hit RAM in the corresponding address
  • ·         If a record from the driving table is already present in the same hash key, flag this record of parent hash table as “MATCHED” and return the columns from both the tables mentioned in the query else go the next record in child table

End loop
Loop (for all the records in the parent hash table which are not flagged as “MATCHED”)
  • ·         Return the columns from the hash table along with NULL values for child probing table’s columns mentioned in the query

End loop

LITTLE-KNOWN FACTS TO BE REMEMBERED:
·         /*+ USE_HASH(<<driving table>>) */ is the hint that can be used to impose this hash outer join.
·         This methodology is opted by Oracle only if both the tables are joined by outer join “(+)” operator.
·         In RAM, hash outer join will be carried out only in the allocated memory space which is nothing but HASH_AREA_SIZE component of PGA.
·         Only driving table will be inserted in the hash table in RAM and the driven table will never be written into the hash table in RAM. Hash function will be applied in the driven table only to locate the corresponding memory entry in the hash table to look for the matching entry from the driving table there.
·         This is very successful when one table is smaller and another table is bigger. In this case, it outplays the nested loop outer join.
·         If the driving table cannot be hashed in RAM in single pass (i.e., at one shot or one go), then a portion of the hash-table spills to disk(actually, TEMP tablespace will be used here to hold that spilled dataset from the driving table). When the hash table is probed by the driven table, the rows with join keys that match those parts of the in-memory hash table are joined immediately; the rest [of the driven table] are also written into TEMP tablespace and joined in the second pass. The bigger driving table is, the smaller the proportion of the hash table that can fit in RAM, the remaining data will be spilled into TEMP tablespace and have to go through the subsequent passes till it processes all the records spilled into TEMP tablespace from both these driving and driven table. This slows the Hash Join down considerably and also makes the join non-scalable in this scenario.

ADVANTAGE:
·         If the driving table is small enough to fit in RAM, outer join “(+)” operator is used in the query & the driven table is very big, then the hash outer join outplays the nested loop outer join.
·         Still nested loop outer join will be the fastest way to retrieve the first matching record [since hash outer join takes some time to build the hash table in RAM for the driving table as it applies hash function] but hash outer join will be the fastest way to get all the matching records quickly when compared to the nested loop outer join. In general, hash function (in hash outer join) will perform better and more efficiently.

DISADVANTAGE:
·         When this is opted for joining 2 big tables, TEMP tablespace will be used extensively as the spilled dataset from both driving and driven tables will be written into TEMP tablespace
·         The above scenario makes hash outer join as more non-scalable as it has to go through subsequent passes to join these spilled dataset (available in TEMP tablespace) from both these tables

HOW TO VERIFY:
How to verify whether Oracle follows hash outer join or not while executing the sql query. If a query follows this, then you will find similar execution plan like this,

--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                     | Name                                             | Rows  |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                   |                                                          |            |
|   1 |    HASH JOIN (OUTER)                               |                                                          |            |
|   2 |       TABLE ACCESS (FULL)                          | <<parent table’s name>>     |            |
|   3 |       TABLE ACCESS (FULL)                          | <<child table’s name>>          |            |
In the explain plan, whenever it chooses this methodology, it displays the keyword “HASH JOIN (OUTER)” in the operation column. Whichever the table name that is getting displayed immediately after this word is nothing but the driving table (otherwise called as hash table) and that’s why you will find parent table’s name in that position followed by the driven table’s name (otherwise called as probe table).


EXAMPLE:

In order to understand how the mechanism of hash outer join is different from hash equi-join, I am taking the same example which I used there (Tip#20) with a few modifications and imposing the outer join.

Create 2 tables.
One table is store the information about all the employees and inserts 6 records.
Another table is store the information of the sales done by the employees and inserts 3 records.
The tables will look like this,

DATA TABLE:
EMP:
ROWID
Empid
Empname
AAAAA1
3825
CHARLES
AAAAA2
9827
FERGUSON
AAAAA3
2389
NADAL
AAAAA4
1784
ERIN
AAAAA5
4556
ROONEY
AAAAA6
8711
TOMMY

SALES:
ROWID
Empid
Sales_Amt
AAAAA7
9827
6000
AAAAA8
2389
8500
AAAAA9
5642
4000


Fire this query against these tables where the requirement is to display the employee’s name & sales_amt (if they have done) of all the employess available in EMP table,
Select a.empname, b.sales_amt
from emp a, sales b
where a.empid = b.empid(+);

since outer-join “(+)” operator is used , hash join will be used to join these 2 tables.

As per this query, emp is the parent table and bonus is the child table since the latter[sales.empid(+)] is outer-joined with the former. So, oracle will consider emp as the driving able and bonus as the driven table though the latter is smaller than the former. That’s the reason cost is not the deciding factor and join condition is the deciding factor in hash outer join.

Assume, Oracle can hold 10 records at a time in the allocated memory space(defined by HASH_AREA_SIZE global parameter) in RAM.
For easy understanding, pls assume the memory address will be from 0 to 9 to hold the records in the hash table.
For easy understanding, pls assume MOD function to the base 10 will be used as hash function here. In real world, the hash function would be more complex and hard to decode it.

EMPTY HASH TABLE:
=================

As per the design of hash outer join, the following steps will be followed while executing this query,
1.     Since SALES is outer-joined with EMP, EMP is considered as the driving table and SALES is considered as the driven table
2.     For the first record in EMP, hash function is applied and it returns 5 as the hashed value (i.e.,hash(3825)èmod(3825/10)è5)
3.     So, this record will be inserted as the 6th record in hash table
4.     For the next record in EMP, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)èmod(9827/10)è7)
5.     So, this record will be inserted as the 8th record in hash table
6.     For the next record in EMP, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)èmod(2389/10)è9)
7.     So, this record will be inserted as the 10th record in hash table
8.     For the next record in EMP, hash function is applied and it returns 4 as the hashed value (i.e.,hash(1784)èmod(1784/10)è4)
9.     So, this record will be inserted as the 5th record in hash table
10.  For the next record in EMP, hash function is applied and it returns 6 as the hashed value (i.e.,hash(4556)èmod(4556/10)è6)
11.  So, this record will be inserted as the 7th record in hash table
12.  For the next record in EMP, hash function is applied and it returns 1 as the hashed value (i.e.,hash(8711)èmod(8711/10)è1)
13.  So, this record will be inserted as the 2nd record in hash table. So, the hash table will look like this at the end of this step,

14.  For the first record in SALES, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)èmod(9827/10)è7)
15.  So, it hits 8th record of hash table
16.  Since there is a record from EMP is available in 8th record of hash table, this record will be flagged as “MATCHED” and data will be returned in the output
17.  For the second record in SALES, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)èmod(2389/10)è9)
18.  So, it hits 10th record of hash table
19.  Since there is a record from EMP is available in 10th record of hash table, this record will be flagged as “MATCHED” and data will be returned in the output
20.  For the third(last) record in SALES, hash function is applied and it returns 2 as the hashed value (i.e.,hash(5642)èmod(5642/10)è2)
21.  So, it hits 3rd record of hash table
22.  Since there is no record from EMP is available in 3rd record of hash table, no data will be returned in the output

23.  All the records from hash table is being processed once more those are not flagged as “MATCHED”, all those records (empids:8711,1784,3825,4556) from the hash table will be returned along with NULL for the child table’s columns mentioned [sales_amt column in this case] in the query.
24.  It exits.

So, the output will look like this,
Empname
Sales_amt
FERGUSON
6000
NADAL
8500
TOMMY
ERIN
CHARLES
ROONEY


If you close look at the output, you can see that the output is displayed in such a way that, matching records are getting displayed at first and then followed by the unmatching records available only in EMP table.

Now, explain table will look like this,
--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                     | Name                                             | Rows  |
--------------------------------------------------------------------------------------------------------
|   4 | SELECT STATEMENT                                   |                                                          |  6          |
|   3 |    HASH JOIN (OUTER)                               |                                                          |  6          |
|   1 |          TABLE ACCESS (FULL)                       | EMP                                               |  6          |
|   2 |          TABLE ACCESS (FULL)                       | SALES                                             |  3          |
From the explain plan, we can say that both the tables are joined by hash outer join methodology, EMP is considered as the driving table [hash table] and SALES is considered as the driven table [probe table].

5 comments:

  1. Thanks! The little-known-fact about ansi-outer-join not doing hash join outer helped me out!

    ReplyDelete
  2. Thank's a lot for the clear explanation!

    ReplyDelete
  3. Good explanation Kali ;-)

    Appreciate your efforts

    ReplyDelete
  4. can we have index on table while doing hash right outer join to avoid full table acess?

    ReplyDelete