Monday, December 31, 2012

Hash Semi Right Join : Oracle Tuning Tip#27

TOPIC:
Hash Semi Right Join

DEFINITION:
As described semi-join in previous forum (Tip#25), a semi-join between two tables returns rows from the outer table where one or more matches are found in the inner table. The difference between a semi-join and a conventional join is that rows in the outer table will be returned at most once. Even if the outer table contains one or more matches for a row in the inner table, only one copy of the row from the outer table will be returned. Outer table is nothing but the table mentioned in the outer query of the statement and inner table is nothing but the table mentioned in the inner query of the statement. Semi-joins are written using the EXISTS or IN constructs (or operators).

As mentioned in the earlier tips, semi join can be achieved by 3 different joining techniques. But, from Oracle 10g onwards, Oracle has started to provide the fourth joining technique to handle the semi join.
•    Nested loop semi join
•    Hash semi join
•    Hash semi right join (from Oracle 10g onwards)
•    Merge semi join

We will discuss about Hash Semi Right Join in this forum.

Hash semi right join is preferred over hash semi join only if the inner table is semi-joined with the outer table and the latter is bigger than the former either in terms of number of records or the number of distinct values in the joining column(s)) . In this case, hash semi right join would perform more efficiently than hash semi join and that’s the reason oracle will opt for the former instead of the latter from 10g onwards.

As you aware, in hash semi join, deciding which table is going to be the hash table is immaterial and the outer table is always selected as the hash table irrespective of its size. This shows the design flaw in hash semi join as the outer table would be used as the hash table, if it is big, all the records from the big outer table can’t be processed in a single pass, allocated memory space in RAM might not be big enough to hold all the required records of the big outer table, so it has to do multiple passes and takes a lot of time. This design flaw has been overcome in hash semi right join. If that small inner table was selected as the hash table, none of these issues won’t need to be addressed and that’s what hash semi right join exactly does.

In a conventional semi-join, the nested loops algorithm is desirable when the predicate(s) on the first table(outer table) are very selective and the join column(s) in the second table(inner table) are selectively indexed. The merge and hash join algorithms, meanwhile, are more desirable when predicates are not very selective in the first table(outer table) or the join columns in the second table(inner table) are not selectively indexed. In hash semi right join, Oracle uses RAM memory to speed up the join. In this, Oracle does a scan of the inner table, builds a RAM hash table, and then probes for matching rows from the outer table. For certain types of SQL, the hash semi right join will execute faster than the hash semi join, and the former also uses less RAM resources in that scenario.

To perform this hash semi right join, Oracle follows these steps (assume, a small inner table has to be semi-joined with a big outer table):
1.    Oracle chooses the small inner table as the hash table.
2.    Oracle build the hash table in RAM after applying the hash function on the joining column(s) of the inner table but make sure that only one record gets inserted in memory for each distinct value of joining column(s).
3.    Oracle chooses the big outer table as the probe table (or probing table). It traverse through all the records of this probe table, applies the same hash function on the joining column(s) [column(s) used to join these two tables] and will hit the corresponding entry in the hash table.
4.    Oracle returns the probe table’s record in the output if a record from the inner table is already present with the same hash key else the probe table’s record will be skipped.

The logical activity diagram for this methodology will be like this,

Loop (for all the required records in the inner table)
•    Apply the hash function for each record to get the address [hashed value] in RAM
•    Hit RAM in the corresponding address
•    Insert the record only if that hash key location is empty else skip it => [Semi join requirement is met]
End loop
Loop (for all the required records in the outer table)
•    Apply the hash function for each record to get the address [hashed value] in RAM
•    Hit RAM in the corresponding address
•    If this probe table record hits the hash table where a record from the inner table is already available, then the probe table’s record is returned in the output else skip to the next probe table’s record.
End loop


In the above logical activity diagram, the step highlighted in red is the important step where the hash semi right join meets the semi join requirement. In hash semi right join, if there are more than 1 record is available for the same value of the joining column(s) in the inner table, all those records will be skipped except anyone record getting inserted into the hash table. By this, displaying the same outer table record (in case of hits) again & again is being avoided and thus meet the requirement of semi join.

LITTLE-KNOWN FACTS TO BE REMEMBERED:
•    /*+ HASH_SJ */ is the hint that can be used to impose this hash semi right join. But this hint has to be applied inside the subquery of the EXISTS or IN clause, not in the main query itself.
•    From 10g version onwards, oracle can consider inner table as the hash table in imposing hash semi right join if either the number of records or the number of distinct values on the joining column(s) is lesser in inner table. The hash table and probe table was fixed in hash semi join till 9i. This has been overcome in an extended version of hash semi join which is named “hash right semi join” from 10g version onwards.
•    An in-line view with NO_MERGE hint could be used to isolate the ill effect of DISTINCT keyword so the semi-join access paths would not be defeated. This is applicable for all type of semi-joins.
•    In general (or atleast till Oracle9i), Oracle transforms a subquery with semi-join into a conventional equi-join if at all possible. Oracle does not consider “cost” when deciding whether or not to do this transformation. So Oracle can execute a query with EXISTS (or) IN clause either by opting for any semi-join access path or conventional join access path followed by an unique operation to remove duplicate rows if the transformation happens. Implementation of Hash Semi Right join has been getting improved in each Oracle version. So higher the version, Oracle may not be opting to do this transformation and just go with any semi-join access path if EXISTS (or) IN clause is used in the query. This is applicable for all type of semi-joins.
•    In RAM, hash semi right join will be carried out only in the allocated memory space which is nothing but HASH_AREA_SIZE component of PGA.
•    Only the inner table records will be inserted as the hash table in RAM and the outer table records will never be written into RAM. Hash function will be applied in the probe table (outer table here) only to locate the corresponding memory entry in the hash table to look for the matching entry from the inner table there.
•    No column(s) from the subquery can be returned in the output.

•    If the query contains DISTINCT clause, then Oracle cannot impose this semi-join though EXISTS (or) IN clause is mentioned in the query.
•    If the query contains UNION set operator, then Oracle cannot impose this semi-join though EXISTS (or) IN clause is mentioned in the query.
•    If the EXISTS (or) IN clause is part of an OR branch in the query, then Oracle cannot impose this semi-join.
•    However this semi-join can be imposed in queries that contain OR clause(s) in the WHERE clause, just as long as the EXISTS (or) IN is not part of the OR.
•    Even the operator, “=ANY” can be used to impose this semi-join apart from IN & EXISTS constructs.
•    The statement must have a subquery in the IN (or) EXISTS clause in order to invoke this semi-join. It means, usual IN clause wherein hardcoded value(s) mentioned in R.H.S (Right Hand Side) of this clause cannot invoke this semi-join.
•    This methodology is opted by Oracle only if a small inner table is semi-joined with a big outer table. So this is very successful when inner table is smaller and outer table is bigger. In this case, it outplays the hash semi join.
•    If the inner table cannot be hashed in RAM in single pass (i.e., at one shot or one go), then a portion of the hash-table spills to disk(actually, TEMP tablespace will be used here to hold that spilled dataset from the inner table). When the hash table is probed by the outer table, the rows with join keys that match those parts of the in-memory hash table are joined immediately; the rest [of the outer table] are also written into TEMP tablespace and joined in the subsequent passes. The bigger inner table is, the smaller the proportion of it in RAM, the remaining data will be spilled into TEMP tablespace and have to go through the subsequent passes till it processes all the records spilled into TEMP tablespace from both these inner and outer tables. This slows down Hash Semi Right Join considerably and also makes the join non-scalable in this scenario.

 

ADVANTAGE:
•    This is very successful when many records of the outer table doesn’t have the matching records in the inner table, only some records in the outer table has lots of matching records in the inner table, and both the tables are not properly indexed. In this scenario, it outperforms both nested loop semi join and hash semi join.
•    If a query is the best candidate for hash semi right join, then this semi join can reduce logical reads, CPU time & elapsed time dramatically but will increase physical reads (with no harm).
•    Still nested loop semi join will be the fastest way to retrieve the first matching record [since hash semi right join takes some time to build the hash table in RAM for the inner table as it applies hash function] but hash semi right join will be the fastest way to get all the matching records quickly when compared to nested loop semi join. In general, hash function (in hash semi right join) will perform better and more efficiently.

DISADVANTAGE:
•    If hash semi right join is imposed in a query wherein if all the records in the outer table has lots of matching records in the inner table and the inner table is properly indexed, then it will behave badly than the nested loop semi join. The reason is, as per hash semi right join, all the records in the inner table will be hashed and thus unnecessarily waste time there instead of stopping once the first matching record is seen. In this scenario, it will be better if nested loop semi join is used.
•    If the allocated space in RAM memory is not large enough to hold the entire hash table, Oracle will use the temporary tablespace, and then complete the join operation. If this join uses the temporary tablespace, then the execution time of the query will get affected. This scenario makes hash semi right join more non-scalable as it has to go through subsequent passes to join these spilled dataset (available in TEMP tablespace) from both these tables

HOW TO VERIFY:
How to verify whether Oracle follows hash semi right join or not while executing SQL. If a query follows this, then you will find similar execution plan like this,

In the explain plan, whenever it chooses this methodology, it displays the keyword [HASH JOIN (RIGHT SEMI)] in the operation column. Whichever the table name that is getting displayed immediately after this word is nothing but the inner table and the next one is the outer table.

EXAMPLE:
In order to understand how the design flaw of hash semi join has been overcome by hash semi right join, we will first discuss the example with the hash semi right join. Then we would discuss how it would behave badly if the same was done with hash semi join.

Create 2 tables.
One table is store the information about all the employees and inserts 6 records.
Another table is store the information of the sales done by the employees and inserts 9 records.
The tables will look like this,

DATA TABLE:

EMP:
SALES:
Fire this query against these tables where the requirement is to display the employee’s id & name of all the employess who has done atleast single sales transaction,
Select a.empid, a.empname
from emp a
where exists (select 1 from sales b where b.empid = a.empid);


Assume, Oracle can hold 3 records at a time in the allocated memory space(defined by HASH_AREA_SIZE global parameter) in RAM.
For easy understanding, pls assume MOD function to the base 10 will be used as hash function here. In real world, the hash function would be much more complex and hard to decode it.

Case-I: (for hash semi right join)
========================


We will discuss what would happen if hash semi right join was imposed. You would see this behavior only from Oracle 10g as this semi join has been implemented from this version only. As per this join, sales is the hash table and emp is the probe table since the number of distinct values in joining column (empid) is lesser in sales table.


EMPTY HASH TABLE:
=================
As per the design of hash semi right join, the following steps will be followed while executing this query,
1.    Since SALES has less number of distinct values (on joining column) than EMP, SALES is considered as the hash table and EMP is considered as the probing table.
2.    Since the memory can hold 3 records at a time, only 3 distinct values (on joining column) is available in SALES table, everything will get processed in a single pass.
3.    For the first record in SALES, hash function is applied and it returns 1 as the hashed value (i.e.,hash(4871)=>mod(4871/10)=>1).
4.    Since there is no record gets inserted in hash:1 location, this record will be inserted there.
5.    For the second record in SALES, hash function is applied and it returns 1 as the hashed value (i.e.,hash(4871)=>mod(4871/10)=>1).
6.    Since a record is already available in hash:1 in the memory, this record will be skipped.
7.    For the third record in SALES, hash function is applied and it returns 1 as the hashed value (i.e.,hash(4871)=>mod(4871/10)=>1).
8.    Since a record is already available in hash:1 in the memory, this record will be skipped. So, the hash table will intermediately look like this at the end of this step

9.    For the fourth record in SALES, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)=>mod(9827/10)=>7).
10.    Since there is no record gets inserted in hash:7 location, this record will be inserted there.
11.    For the fifth record in SALES, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)=>mod(9827/10)=>7).
12.    Since a record is already available in hash:7 in the memory, this record will be skipped.
13.    For the sixth record in SALES, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)=>mod(9827/10)=>7).
14.    Since a record is already available in hash:7 in the memory, this record will be skipped.
15.    For the seventh record in SALES, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)=>mod(2389/10)=>9).
16.    Since there is no record gets inserted in hash:9 location, this record will be inserted there.
17.    For the eighth record in SALES, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)=>mod(2389/10)=>9).
18.    Since a record is already available in hash:9 in the memory, this record will be skipped.
19.    For the ninth (& final) record in SALES, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)=>mod(2389/10)=>9).
20.    Since a record is already available in hash:9 in the memory, this record will be skipped. So, the hash table will look like this at the end of this step,

21.    For the first record in EMP, hash function is applied and it returns 1 as the hashed value (i.e.,hash(4871)=>mod(4871/10)=>1)
22.    So, it hits hash:1 location in the hash table. Since a record from SALES is already available in the hast table with the same hash key, this data will be returned in the output
23.    For the second record in EMP, hash function is applied and it returns 5 as the hashed value (i.e.,hash(3825)=>mod(3825/10)=>5)
24.    So, it hits hash:5 location in the hash table. Since there is no hash key with this value, this data will not be returned in the output
25.    For the third record in EMP, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)=>mod(9827/10)=>7)
26.    So, it hits hash:7 location in the hash table. Since a record from SALES is already available in the hast table with the same hash key, this data will be returned in the output
27.    For the fourth record in EMP, hash function is applied and it returns 2 as the hashed value (i.e.,hash(5742)=>mod(5742/10)=>2)
28.    So, it hits hash:2 location in the hash table. Since there is no hash key with this value, this data will not be returned in the output
29.    For the fifth record in EMP, hash function is applied and it returns 4 as the hashed value (i.e.,hash(1784)=>mod(1784/10)=>4)
30.    So, it hits hash:4 location in the hash table. Since there is no hash key with this value, this data will not be returned in the output
31.    For the sixth (& last) record in EMP, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)=>mod(2389/10)=>9)
32.    So, it hits hash:9 location in the hash table. Since a record from SALES is already available in the hast table with the same hash key, this data will be returned in the output
33.    It exits

So, the output will look like this,
If you closely look at the output, you can see that “4871”, “9827” & “2389” employees get displayed only once though all have 3 matching records in the sales table. This is what the expected outcome for this requirement & hash semi [right] join does the same.

Now, explain plan will look like this,

From the explain plan, we can say that both the tables are joined by hash semi [right] join, SALES is considered as the hash table and EMP is considered as the probe table.

If you closely look at the above executed steps, in a single pass, all the records from both the tables have been processed. Unlike hash semi join, since SALES table is considered as the hash table, all its records have been inserted into the memory in a single pass. This is how the performance has been improved as all the records have been processed in a single pass & TEMP tablespace has never been used. This is how it outplays hash semi join. If you closely look at the values mentioned in the last column “Rows” of the explain plan, all the values have been lesser than of hash semi join (which is explained next in this forum). So, the design flaw of hash semi join has finally been overcome by hash semi right join.

Case-II: (for hash semi join)
=====================


We will discuss what would happen if hash semi join was imposed.

As per this join, emp is the hash table and sales is the probe table since the latter is the inner table and the former is the outer table. That’s the reason cost is not the deciding factor and join condition is the deciding factor in deciding the hash table for the hash semi join. But this join is not the right join for this SQL and we will see how it will badly affect the query below.

EMPTY HASH TABLE:
=================
 
As per the design of hash semi join, the following steps will be followed while executing this query,
1.    Since EMP is outer table and SALES is inner table, EMP is considered as the hash table and SALES is considered as the probe table.
2.    Since the memory can hold only 3 records at a time, only the first 3 records of EMP will be processed in the first pass and the remaining 3 records will temporarily be kept at TEMP tablespace.
3.    For the first record in EMP, hash function is applied and it returns 1 as the hashed value (i.e.,hash(4871)=>mod(4871/10)=>1).
4.    So, this record will be inserted in hash:1 location hash table.
5.    For the second record in EMP, hash function is applied and it returns 5 as the hashed value (i.e.,hash(3825)=>mod(3825/10)=>5).
6.    So, this record will be inserted in hash:5 location hash table.
7.    For the third record in EMP, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)=>mod(9827/10)=>7).
8.    So, this record will be inserted in hash:7 location hash table. So, the hash table will look like this at the end of this step,
 
9.    For the first record in SALES, hash function is applied and it returns 1 as the hashed value (i.e.,hash(4871)=>mod(4871/10)=>1)
10.    So, it hits 1st of hash table
11.    Since there is a record from EMP is available in 1st record of hash table, this hash record will be flagged as “MATCHED NEW” and the same will be returned in the output
12.    For the second record in SALES, hash function is applied and it returns 1 as the hashed value (i.e.,hash(4871)=>mod(4871/10)=>1)
13.    So, it hits 1st of hash table
14.    Since there is a record from EMP is available in 1st record of hash table and already flagged as “MATCHED NEW”, this hash record will be re-flagged as “MATCHED ALREADY” but the same will not be returned in the output
15.    For the third record in SALES, hash function is applied and it returns 1 as the hashed value (i.e.,hash(4871)=>mod(4871/10)=>1)
16.    So, it hits 1st of hash table
17.    Since there is a record from EMP is available in 1st record of hash table and already flagged as “MATCHED ALREADY”, this hash record will be left it like that but the same will not be returned in the output
18.    For the fourth record in SALES, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)=>mod(9827/10)=>7)
19.    So, it hits 3rd of hash table
20.    Since there is a record from EMP is available in 3rd record of hash table, this hash record will be flagged as “MATCHED NEW” and the same will be returned in the output
21.    For the fifth record in SALES, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)=>mod(9827/10)=>7)
22.    So, it hits 3rd of hash table
23.    Since there is a record from EMP is available in 3rd record of hash table and already flagged as “MATCHED NEW”, this hash record will be re-flagged as “MATCHED ALREADY” but the same will not be returned in the output
24.    For the sixth record in SALES, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)=>mod(9827/10)=>7)
25.    So, it hits 3rd of hash table
26.    Since there is a record from EMP is available in 3rd record of hash table and already flagged as “MATCHED ALREADY”, this hash record will be left it like that but the same will not be returned in the output
27.    For the seventh record in SALES, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)=>mod(2389/10)=>9)
28.    Since, this hashed value (9) is out of range for the currently constructed hash table, it will temporarily be kept in TEMP tablespace
29.    For the eighth record in SALES, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)=>mod(2389/10)=>9)
30.    Since, this hashed value (9) is out of range for the currently constructed hash table, it will temporarily be kept in TEMP tablespace
31.    For the ninth record in SALES, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)=>mod(2389/10)=>9)
32.    Since, this hashed value (9) is out of range for the currently constructed hash table, it will temporarily be kept in TEMP tablespace

33.    Since the memory can hold only 3 records at a time, then the next (& final) 3 EMP records from TEMP tablespace will be processed in this second pass.
34.    For the fourth record in EMP, hash function is applied and it returns 2 as the hashed value (i.e.,hash(5742)=>mod(5742/10)=>2).
35.    So, this record will be inserted in hash:2 location hash table.
36.    For the fifth record in EMP, hash function is applied and it returns 4 as the hashed value (i.e.,hash(1784)=>mod(1784/10)=>4).
37.    So, this record will be inserted in hash:4 location hash table.
38.    For the sixth (& last) record in EMP, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)=>mod(2389/10)=>9).
39.    So, this record will be inserted in hash:9 location hash table. So, the hash table will look like this at the end of this step,
 
40.    Since only 3 records from SALES table is available in TEMP tablespace, only these 3 records will be used for probing the hash table.
41.    For the seventh record of SALES in TEMP tablespace, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)=>mod(2389/10)=>9)
42.    So, it hits 3rd of hash table
43.    Since there is a record from EMP is available in 3rd record of hash table, this hash record will be flagged as “MATCHED NEW” and the same will be returned in the output
44.    For the eighth record of SALES in TEMP tablespace, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)=>mod(2389/10)=>9)
45.    So, it hits 3rd of hash table
46.    Since there is a record from EMP is available in 3rd record of hash table and already flagged as “MATCHED NEW”, this hash record will be re-flagged as “MATCHED ALREADY” but the same will not be returned in the output
47.    For the ninth (& last) record of SALES in TEMP tablespace, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)=>mod(2389/10)=>9)
48.    So, it hits 3rd of hash table
49.    Since there is a record from EMP is available in 3rd record of hash table and already flagged as “MATCHED ALREADY”, this hash record will be left it like that but the same will not be returned in the output

50.    Since no more records from both EMP & SALES tables are available in TEMP tablespace, it exits.

So, the output will look like this,
 
Now, explain table will look like this,
From the explain plan, we can say that both the tables are joined by hash semi join methodology, EMP is considered as the hash table and SALES is considered as the probe table. 

If you closely look at the above executed steps, TEMP tablespace has been used extensively to store the data of both EMP & SALES temporarily. 2 passes were required to process all the records from both the tables. This (performance issue) happened since the big outer table has been used to build the hash table and the small inner table (in terms of number of distinct values of empid joining column) has been used to probe the hash table (as per hash semi join). If you closely look at the values mentioned in the last column “Rows” of the explain plan, all the values have been shoot up (when compared to the numbers of hash semi right join as mentioned in the first case) since TEMP tablespace has been used extensively and multiple passes went through. This performance issue has been overcome by hash semi right join as discussed above in the first case.

Sunday, March 4, 2012

Hash Semi Join : Oracle Tuning Tip#26



TOPIC:
Hash Semi Join

DEFINITION:

As described semi-join in previous forum (Tip#25), a semi-join between two tables returns rows from the outer table where one or more matches are found in the inner table. The difference between a semi-join and a conventional join is that rows in the first table will be returned at most once. Even if the second table contains two or more matches for a row in the first table, only one copy of the row from the first table will be returned. Semi-joins are written using the EXISTS or IN constructs.

We will discuss about Hash Semi Join in this forum.

In hash semi join, Oracle uses RAM memory to speed up the join. In this, Oracle does a scan of the driving table, builds a RAM hash table, and then probes for matching rows in the driven table. For certain types of SQL, the hash semi join will execute faster than the nested loop semi join, but the former uses more RAM resources. Invoking of hash semi join by the SQL optimizer is heavily favored by the setting of hash_area_size Oracle parameter. Larger the value for hash_area_size, more hash semi joins the optimizer will invoke. 

In a conventional semi-join, the nested loops algorithm is desirable when the predicate(s) on the first table(main table) are very selective and the join column(s) in the second table(table in the subquery) are selectively indexed. The merge and hash join algorithms, meanwhile, are more desirable when predicates are not very selective in the first table(main table) or the join columns in the second table(table in the subquery) are not selectively indexed.

First table, main table, outer table & driving table are just different terminologies but represent the same entity (hash table). Likewise, second table, subquery table, inner table & driven table are just different terminologies but represent the same entity (probe table).

To perform hash semi join, Oracle follows these steps:
1.     Oracle chooses the table in the main query as the hash table (otherwise called as driving table). Oracle built the hash table in RAM after applying the hash function on the joining column(s) of the driving table.
2.     Oracle chooses the table in the subquery as the probe table (otherwise called as driven table or probing table). It traverse through all the records of this probe table, applies the same hash function on the joining column(s) [column(s) used to join these two tables] and will hit the corresponding entry in the hash table.
3.     If a record from the probe table hits the hash table where a record from the driving table is already available, then Oracle will flag this hash row as “MATCHED NEW” and returns the record of the driving table. [ I used “MATCHED NEW” flag term to explain this step in an easy way].
4.     If an another record from the probe table hits the same row in the hash table again, then Oracle will flag this hash row as “MATCHED ALREADY” but that driving table record will not be returned again. By this logic, it meets the requirement of semi join.

As you aware, in hash equi-join, deciding which table is going to be the hash table is the deciding factor. Generally, the smallest of the two tables will be considered as the driving table so that cost will be low. [cost will be low because the smallest of two tables will fit perfectly in RAM, occupy less amount of memory space in RAM and child table will completely be traversed in a single pass]. However, in this hash semi join, this decision making process becomes immaterial since the outer table will be chosen as the hash table by default irrespective of whether it is a small or big table.

The logical activity diagram for this methodology will be like this,

Read all the required data from the outer table and built the hash table in RAM after applying hash function
Loop (for all the required records in the inner table)
·         Apply the hash function for each record to get the address [hashed value] in RAM
·         Hit RAM in the corresponding address
·         If this probe table record hits the hash table where a record from the driving table is already available, then flag this hash row as “MATCHED NEW” if it is the first time. Flag this same hash row as “MATCHED ALREADY” from the second and subsequent hits (from other matching records of the probing table) onwards.
·         If it is a “hit”, then return the record of the driving table only if flag = MATCHED NEW   è [Semi join requirement is met]
End loop

In the above logical activity diagram, the step highlighted in red is the important step where the hash semi join meets the semi join requirement. In hash semi join, if there are more than 1 matching record in the probe table, all those matching records will be probed and will hit the same row in the hash table but the record of hash table will be returned only once when the flag is set as MATCHED NEW. By this, displaying the same hash table record (in case of hits) again & again is being avoided and thus meet the requirement of semi join.

LITTLE-KNOWN FACTS TO BE REMEMBERED:
·         /*+ HASH_SJ */ is the hint that can be used to impose this hash semi join. But this hint has to be applied inside the subquery of the EXISTS or IN clause, not in the main query itself.
·         From 10g version onwards, oracle can switch the driving table while imposing hash semi join based on the sizes of the row source. The hash table and probe table was fixed in hash semi join till 9i. From 10g onwards, the hash table and probe table can be switched based on the size of the outer & inner tables. Because of this new logic, an extended version of hash semi join is introduced which is named “hash right semi join” from 10g version onwards. We will be discussing this new join in the next forum.
·         An in-line view with NO_MERGE hint could be used to isolate the ill effect of DISTINCT keyword so the semi-join access paths would not be defeated. This is applicable for all type of semi-joins.
·         Oracle transforms a subquery with semi-join into a conventional equi-join if at all possible. Oracle does not consider cost when deciding whether or not to do this transformation. So Oracle can execute a query with EXISTS (or) IN clause either by opting for any semi-join access path or conventional join access path followed by a sort to remove duplicate rows if the transformation happens. Implementation of Hash Semi join has been getting improved in each Oracle version. So higher the version, Oracle may not be opting to do this transformation and just go with any semi-join access path if EXISTS (or) IN clause is used in the query. This is applicable for all type of semi-joins.
·         In RAM, hash semi join will be carried out only in the allocated memory space which is nothing but HASH_AREA_SIZE component of PGA.
·         Only the driving table will be inserted as the hash table in RAM and the driven table will never be written into RAM. Hash function will be applied in the driven table only to locate the corresponding memory entry in the hash table to look for the matching entry from the driving table there.
·         No column(s) from the subquery can be returned in the output.
·         If the query contains DISTINCT clause, then Oracle cannot impose this semi-join though EXISTS (or) IN clause is mentioned in the query.
·         If the query contains UNION set operator, then Oracle cannot impose this semi-join though EXISTS (or) IN clause is mentioned in the query.
·         If the EXISTS (or) IN clause is part of an OR branch in the query, then Oracle cannot impose this semi-join.
·         However this semi-join can be imposed in queries that contain OR clause(s) in the WHERE clause, just as long as the EXISTS (or) IN is not part of the OR.
·         Even the operator, “=ANY” can be used to impose this semi-join apart from IN & EXISTS clauses.
·         The statement must have a subquery in the IN (or) EXISTS clause in order to invoke this semi-join. It means, usual IN clause wherein hardcoded value(s) mentioned in R.H.S (Right Hand Side) of this clause cannot invoke this semi-join.

ADVANTAGE:
·         This is very successful when each record in the outer table has 1 or few matching record(s) in the inner table. In this scenario, it outperforms nested loop semi join even if the joining columns in the inner table are properly indexed.
·         If a query is the best candidate for hash semi join, then this semi join can reduce logical reads, CPU time & elapsed time dramatically but will increase physical reads (with no harm).
·         This is very successful when the joining column(s) of the inner table is not properly indexed.
·         Still nested loop semi join will be the fastest way to retrieve the first matching record [since hash semi join takes some time to build the hash table in RAM for the driving table as it applies hash function] but hash semi join will be the fastest way to get all the matching records quickly when compared to nested loop semi join. In general, hash function (in hash semi join) will perform better and more efficiently.

DISADVANTAGE:
·         If hash semi join is imposed in a query wherein if each record in the outer table has lots of matching records in the inner table, then it will behave very badly. The reason is, as per hash semi join, all the required records in the inner table will be probed and thus unnecessarily waste time there instead of stopping once the first matching record is seen. In this scenario, it will be better if nested loop semi join is used.
·         If the allocated space in RAM memory is not large enough to hold the entire hash table, Oracle will use the temporary tablespace, and then complete the join operation. If this join uses the temporary tablespace, then the execution time of the query will get affected badly.

HOW TO VERIFY:
How to verify whether Oracle follows hash semi join or not while executing SQL. If a query follows this, then you will find similar execution plan like this,

--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                     | Name                                             | Rows  |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                   |                                                          |            |
|   1 |    HASH JOIN (SEMI)                                   |                                                          |            |
|   2 |       TABLE ACCESS (FULL)                          | <<outer table’s name>>        |            |
|   3 |       TABLE ACCESS (FULL)                          | <<inner table’s name>>         |            |
In the explain plan, whenever it chooses this methodology, it displays the keyword [HASH JOIN (SEMI)] in the operation column. Whichever the table name that is getting displayed immediately after this word is nothing but the driving (hash) table and the next one is the driven (probe) table.

EXAMPLE:

Create 2 tables.
One table is store the information about all the employees and inserts 4 records.
Another table is store the information of the sales done by the employees and inserts 4 records.
The tables will look like this,

DATA TABLE:
EMP:
ROWID
Empid
Empname
AAAAA1
3825
CHARLES
AAAAA2
9827
FERGUSON
AAAAA3
2389
NADAL
AAAAA4
1784
ERIN

SALES:
ROWID
Empid
Sales_Amt
AAAAB1
9827
8500
AAAAB2
2389
8500
AAAAB3
5642
6000
AAAAB4
9827
2000

Fire this query against these tables where the requirement is to display the employee’s id & name of all the employess who has done atleast single sales transaction,
Select a.empid, a.empname  
from emp a  
where exists (select 1 from sales b where b.empid = a.empid);

since “exists” operator is used and there is no index available in both the tables, hash semi join will be used to join these 2 tables. As per this query, emp is the driving table and sales is the driven table since the latter is inside the subquery. So, oracle will consider emp as the hash able and sales as the probe table.

Assume, Oracle can hold 10 records at a time in the allocated memory space(defined by HASH_AREA_SIZE global parameter) in RAM.
For easy understanding, pls assume the memory address will be from 0 to 9 to hold the records in the hash table.
For easy understanding, pls assume MOD function to the base 10 will be used as hash function here. In real world, the hash function would be much more complex and hard to decode it.

EMPTY HASH TABLE:
=================
As per the design of hash semi join, the following steps will be followed while executing this query,
1.     EMP is considered as the driving table and SALES is considered as the driven table
2.     For the first record in EMP, hash function is applied and it returns 5 as the hashed value (i.e.,hash(3825)èmod(3825/10)è5)
3.     So, this record will be inserted as the 6th record in hash table
4.     For the next record in EMP, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)èmod(9827/10)è7)
5.     So, this record will be inserted as the 8th record in hash table
6.     For the next record in EMP, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)èmod(2389/10)è9)
7.     So, this record will be inserted as the 10th record in hash table
8.     For the next record in EMP, hash function is applied and it returns 4 as the hashed value (i.e.,hash(1784)èmod(1784/10)è4)
9.     So, this record will be inserted as the 5th record in hash table. So, the hash table will look like this at the end of this step,

10.  For the first record in SALES, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)èmod(9827/10)è7)
11.  So, it hits 8th record of hash table
12.  Since there is a record from EMP is available in 8th record of hash table, this hash record will be flagged as “MATCHED NEW” and the same will be returned in the output
13.  For the second record in SALES, hash function is applied and it returns 9 as the hashed value (i.e.,hash(2389)èmod(2389/10)è9)
14.  So, it hits 10th record of hash table
15.  Since there is a record from EMP is available in 10th record of hash table, this hash record will be flagged as “MATCHED NEW” and the same will be returned in the output
16.  For the third record in SALES, hash function is applied and it returns 2 as the hashed value (i.e.,hash(5642)èmod(5642/10)è2)
17.  So, it hits 3rd record of hash table
18.  Since there is no record from EMP is available in 3rd record of hash table, no data will be returned in the output

19.  For the fourth(last) record in SALES, hash function is applied and it returns 7 as the hashed value (i.e.,hash(9827)èmod(9827/10)è7)
20.  So, it hits 8th record of hash table
21.  Since there is a record from EMP is available in 8th record of hash table and already flagged as “MATCHED NEW”, this hash record will be re-flagged as “MATCHED ALREADY” but the same will not be returned in the output
22.  It exits

So, the output will look like this,
Empid
Empname
9827
FERGUSON
2389
NADAL

If you closely look at the output, you can see that “9827” employee gets displayed only once though it has 2 matching records in sales table. This is what the expected outcome for this requirement & hash semi join does the same.

Now, explain plan will look like this,
--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                     | Name                                             | Rows  |
--------------------------------------------------------------------------------------------------------
|   4 | SELECT STATEMENT                                   |                                                          |  2          |
|   3 |    HASH JOIN (SEMI)                                  |                                                          |  2          |
|   1 |          TABLE ACCESS (FULL)                       | EMP                                                |  4          |
|   2 |          TABLE ACCESS (FULL)                       | SALES                                             |  4          |
From the explain plan, we can say that both the tables are joined by hash semi join, EMP is considered as the driving table [hash table] and SALES is considered as the driven table [probe table].