Sunday, March 4, 2012

Hash Semi Join : Oracle Tuning Tip#26



TOPIC:
Hash Semi Join

DEFINITION:

As described semi-join in previous forum (Tip#25), a semi-join between two tables returns rows from the outer table where one or more matches are found in the inner table. The difference between a semi-join and a conventional join is that rows in the first table will be returned at most once. Even if the second table contains two or more matches for a row in the first table, only one copy of the row from the first table will be returned. Semi-joins are written using the EXISTS or IN constructs.

We will discuss about Hash Semi Join in this forum.

In hash semi join, Oracle uses RAM memory to speed up the join. In this, Oracle does a scan of the driving table, builds a RAM hash table, and then probes for matching rows in the driven table. For certain types of SQL, the hash semi join will execute faster than the nested loop semi join, but the former uses more RAM resources. Invoking of hash semi join by the SQL optimizer is heavily favored by the setting of hash_area_size Oracle parameter. Larger the value for hash_area_size, more hash semi joins the optimizer will invoke. 

In a conventional semi-join, the nested loops algorithm is desirable when the predicate(s) on the first table(main table) are very selective and the join column(s) in the second table(table in the subquery) are selectively indexed. The merge and hash join algorithms, meanwhile, are more desirable when predicates are not very selective in the first table(main table) or the join columns in the second table(table in the subquery) are not selectively indexed.

First table, main table, outer table & driving table are just different terminologies but represent the same entity (hash table). Likewise, second table, subquery table, inner table & driven table are just different terminologies but represent the same entity (probe table).

To perform hash semi join, Oracle follows these steps:
1.     Oracle chooses the table in the main query as the hash table (otherwise called as driving table). Oracle built the hash table in RAM after applying the hash function on the joining column(s) of the driving table.
2.     Oracle chooses the table in the subquery as the probe table (otherwise called as driven table or probing table). It traverse through all the records of this probe table, applies the same hash function on the joining column(s) [column(s) used to join these two tables] and will hit the corresponding entry in the hash table.
3.     If a record from the probe table hits the hash table where a record from the driving table is already available, then Oracle will flag this hash row as “MATCHED NEW” and returns the record of the driving table. [ I used “MATCHED NEW” flag term to explain this step in an easy way].
4.     If an another record from the probe table hits the same row in the hash table again, then Oracle will flag this hash row as “MATCHED ALREADY” but that driving table record will not be returned again. By this logic, it meets the requirement of semi join.

As you aware, in hash equi-join, deciding which table is going to be the hash table is the deciding factor. Generally, the smallest of the two tables will be considered as the driving table so that cost will be low. [cost will be low because the smallest of two tables will fit perfectly in RAM, occupy less amount of memory space in RAM and child table will completely be traversed in a single pass]. However, in this hash semi join, this decision making process becomes immaterial since the outer table will be chosen as the hash table by default irrespective of whether it is a small or big table.

The logical activity diagram for this methodology will be like this,

Read all the required data from the outer table and built the hash table in RAM after applying hash function
Loop (for all the required records in the inner table)
·         Apply the hash function for each record to get the address [hashed value] in RAM
·         Hit RAM in the corresponding address
·         If this probe table record hits the hash table where a record from the driving table is already available, then flag this hash row as “MATCHED NEW” if it is the first time. Flag this same hash row as “MATCHED ALREADY” from the second and subsequent hits (from other matching records of the probing table) onwards.
·         If it is a “hit”, then return the record of the driving table only if flag = MATCHED NEW   è [Semi join requirement is met]
End loop

In the above logical activity diagram, the step highlighted in red is the important step where the hash semi join meets the semi join requirement. In hash semi join, if there are more than 1 matching record in the probe table, all those matching records will be probed and will hit the same row in the hash table but the record of hash table will be returned only once when the flag is set as MATCHED NEW. By this, displaying the same hash table record (in case of hits) again & again is being avoided and thus meet the requirement of semi join.

LITTLE-KNOWN FACTS TO BE REMEMBERED:
·         /*+ HASH_SJ */ is the hint that can be used to impose this hash semi join. But this hint has to be applied inside the subquery of the EXISTS or IN clause, not in the main query itself.
·         From 10g version onwards, oracle can switch the driving table while imposing hash semi join based on the sizes of the row source. The hash table and probe table was fixed in hash semi join till 9i. From 10g onwards, the hash table and probe table can be switched based on the size of the outer & inner tables. Because of this new logic, an extended version of hash semi join is introduced which is named “hash right semi join” from 10g version onwards. We will be discussing this new join in the next forum.
·         An in-line view with NO_MERGE hint could be used to isolate the ill effect of DISTINCT keyword so the semi-join access paths would not be defeated. This is applicable for all type of semi-joins.
·         Oracle transforms a subquery with semi-join into a conventional equi-join if at all possible. Oracle does not consider cost when deciding whether or not to do this transformation. So Oracle can execute a query with EXISTS (or) IN clause either by opting for any semi-join access path or conventional join access path followed by a sort to remove duplicate rows if the transformation happens. Implementation of Hash Semi join has been getting improved in each Oracle version. So higher the version, Oracle may not be opting to do this transformation and just go with any semi-join access path if EXISTS (or) IN clause is used in the query. This is applicable for all type of semi-joins.
·         In RAM, hash semi join will be carried out only in the allocated memory space which is nothing but HASH_AREA_SIZE component of PGA.
·         Only the driving table will be inserted as the hash table in RAM and the driven table will never be written into RAM. Hash function will be applied in the driven table only to locate the corresponding memory entry in the hash table to look for the matching entry from the driving table there.
·         No column(s) from the subquery can be returned in the output.
·         If the query contains DISTINCT clause, then Oracle cannot impose this semi-join though EXISTS (or) IN clause is mentioned in the query.
·         If the query contains UNION set operator, then Oracle cannot impose this semi-join though EXISTS (or) IN clause is mentioned in the query.
·         If the EXISTS (or) IN clause is part of an OR branch in the query, then Oracle cannot impose this semi-join.
·         However this semi-join can be imposed in queries that contain OR clause(s) in the WHERE clause, just as long as the EXISTS (or) IN is not part of the OR.
·         Even the operator, “=ANY” can be used to impose this semi-join apart from IN & EXISTS clauses.
·         The statement must have a subquery in the IN (or) EXISTS clause in order to invoke this semi-join. It means, usual IN clause wherein hardcoded value(s) mentioned in R.H.S (Right Hand Side) of this clause cannot invoke this semi-join.

ADVANTAGE:
·         This is very successful when each record in the outer table has 1 or few matching record(s) in the inner table. In this scenario, it outperforms nested loop semi join even if the joining columns in the inner table are properly indexed.
·         If a query is the best candidate for hash semi join, then this semi join can reduce logical reads, CPU time & elapsed time dramatically but will increase physical reads (with no harm).
·         This is very successful when the joining column(s) of the inner table is not properly indexed.
·         Still nested loop semi join will be the fastest way to retrieve the first matching record [since hash semi join takes some time to build the hash table in RAM for the driving table as it applies hash function] but hash semi join will be the fastest way to get all the matching records quickly when compared to nested loop semi join. In general, hash function (in hash semi join) will perform better and more efficiently.

DISADVANTAGE:
·         If hash semi join is imposed in a query wherein if each record in the outer table has lots of matching records in the inner table, then it will behave very badly. The reason is, as per hash semi join, all the required records in the inner table will be probed and thus unnecessarily waste time there instead of stopping once the first matching record is seen. In this scenario, it will be better if nested loop semi join is used.
·         If the allocated space in RAM memory is not large enough to hold the entire hash table, Oracle will use the temporary tablespace, and then complete the join operation. If this join uses the temporary tablespace, then the execution time of the query will get affected badly.

HOW TO VERIFY:
How to verify whether Oracle follows hash semi join or not while executing SQL. If a query follows this, then you will find similar execution plan like this,

--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                     | Name                                             | Rows  |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                   |                                                          |            |
|   1 |    HASH JOIN (SEMI)                                   |                                                          |            |
|   2 |       TABLE ACCESS (FULL)                          | <<outer table’s name>>        |            |
|   3 |       TABLE ACCESS (FULL)                          | <<inner table’s name>>         |            |
In the explain plan, whenever it chooses this methodology, it displays the keyword [HASH JOIN (SEMI)] in the operation column. Whichever the table name that is getting displayed immediately after this word is nothing but the driving (hash) table and the next one is the driven (probe) table.

EXAMPLE:

Create 2 tables.
One table is store the information about all the employees and inserts 4 records.
Another table is store the information of the sales done by the employees and inserts 4 records.
The tables will look like this,

DATA TABLE:
EMP:
ROWID
Empid
Empname
AAAAA1
3825
CHARLES
AAAAA2
9827
FERGUSON
AAAAA3
2389
NADAL
AAAAA4
1784
ERIN

SALES:
ROWID
Empid
Sales_Amt
AAAAB1
9827
8500
AAAAB2
2389
8500
AAAAB3
5642
6000
AAAAB4
9827
2000

Fire this query against these tables where the requirement is to display the employee’s id & name of all the employess who has done atleast single sales transaction,
Select a.empid, a.empname  
from emp a  
where exists (select 1 from sales b where b.empid = a.empid);

since “exists” operator is used and there is no index available in both the tables, hash semi join will be used to join these 2 tables. As per this query, emp is the driving table and sales is the driven table since the latter is inside the subquery. So, oracle will consider emp as the hash able and sales as the probe table.

Assume, Oracle can hold 10 records at a time in the allocated memory space(defined by HASH_AREA_SIZE global parameter) in RAM.
For easy understanding, pls assume the memory address will be from 0 to 9 to hold the records in the hash table.
For easy understanding, pls assume MOD function to the base 10 will be used as hash function here. In real world, the hash function would be much more complex and hard to decode it.

EMPTY HASH TABLE:
=================
As per the design of hash semi join, the following steps will be followed while executing this query,
1.     EMP is considered as the driving table and SALES is considered as the driven table
2.     For the first record in EMP, hash function is applied and it returns 5 as the hashed value (i.e.,hash(3825)èmod(3825/10)è5)
3.     So, this record will be inserted as the 6th record in hash table
4.     For the next record in EMP, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)èmod(9827/10)è7)
5.     So, this record will be inserted as the 8th record in hash table
6.     For the next record in EMP, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)èmod(2389/10)è9)
7.     So, this record will be inserted as the 10th record in hash table
8.     For the next record in EMP, hash function is applied and it returns 4 as the hashed value (i.e.,hash(1784)èmod(1784/10)è4)
9.     So, this record will be inserted as the 5th record in hash table. So, the hash table will look like this at the end of this step,

10.  For the first record in SALES, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)èmod(9827/10)è7)
11.  So, it hits 8th record of hash table
12.  Since there is a record from EMP is available in 8th record of hash table, this hash record will be flagged as “MATCHED NEW” and the same will be returned in the output
13.  For the second record in SALES, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)èmod(2389/10)è9)
14.  So, it hits 10th record of hash table
15.  Since there is a record from EMP is available in 10th record of hash table, this hash record will be flagged as “MATCHED NEW” and the same will be returned in the output
16.  For the third record in SALES, hash function is applied and it returns 2 as the hashed value (i.e.,hash(5642)èmod(5642/10)è2)
17.  So, it hits 3rd record of hash table
18.  Since there is no record from EMP is available in 3rd record of hash table, no data will be returned in the output

19.  For the fourth(last) record in SALES, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)èmod(9827/10)è7)
20.  So, it hits 8th record of hash table
21.  Since there is a record from EMP is available in 8th record of hash table and already flagged as “MATCHED NEW”, this hash record will be re-flagged as “MATCHED ALREADY” but the same will not be returned in the output
22.  It exits

So, the output will look like this,
Empid
Empname
9827
FERGUSON
2389
NADAL

If you closely look at the output, you can see that “9827” employee gets displayed only once though it has 2 matching records in sales table. This is what the expected outcome for this requirement & hash semi join does the same.

Now, explain plan will look like this,
--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                     | Name                                             | Rows  |
--------------------------------------------------------------------------------------------------------
|   4 | SELECT STATEMENT                                   |                                                          |  2          |
|   3 |    HASH JOIN (SEMI)                                  |                                                          |  2          |
|   1 |          TABLE ACCESS (FULL)                       | EMP                                                |  4          |
|   2 |          TABLE ACCESS (FULL)                       | SALES                                             |  4          |
From the explain plan, we can say that both the tables are joined by hash semi join, EMP is considered as the driving table [hash table] and SALES is considered as the driven table [probe table].


9 comments:

  1. Well done, never have seen such a detailed explanation

    ReplyDelete
  2. Thanks , Nicely written. Easy to understand.

    ReplyDelete
  3. The content of information is very informative oracle training in chennai

    ReplyDelete
  4. Thanks for sharing valuable and informative content. Keep it up.

    We also provide same services such as MySQL database and sql and oracle sql free download etc. if you want to take any related services please visit our official website tosska.com.

    ReplyDelete
  5. Thanks for sharing valuable and informative content. Keep it up.

    We also provide same services such as oracle query optimizer tooL and sql query optimization tool etc. if you want to take any related services please visit our official website tosska.com.

    ReplyDelete
  6. Reach to the best
    Science Training institute in Chennai
    for skyrocketing your career, Infycle Technologies. It is the best Software Training & Placement institutes in and around Chennai. that also gives the best placement training for personality tests, interview preparation, and mock interviews for leveling up the candidate's grades to a professional level.

    ReplyDelete
  7. Don’t follow your role model. Be the Role model person for others. But it's so simple by getting Hadoop training in Chennai. Because it is an assurance course to bounce back from a double salary. For joining call 7502633633.

    ReplyDelete