Friday, December 31, 2010

Oracle Tuning Tip#19: Sort Merge Join



TOPIC:
Sort Merge Join

DEFINITION:
To perform sort merge join, Oracle follow these steps[assume table-a & table-b have to be joined]:
  1. 1.     Oracle read all the required records from table-a and sort the data by the joining column(s)
  2. 2.     Oracle read all the required records from table-b and sort the data by the same joining column(s)
  3. 3.     Oracle then compare all these sorted data from both these tables record-by-record based on the joining column(s) and returns the output if it gets matched

In this join, there is no concept of driving & driven table like nested loop join. Important point to understand is, unlike nested loop join where driven(inner) table is read as many number of times as the input from outer table, in sort merge join each of the tables involved are accessed at most once. Actually sort merge join has two phases, Sorting & Merging. Step-1 & Step-2 are of Sorting phase and Step-3 is of Merging phase. In the sorting phase, all the required records from both the tables are read and then get sorted. In the merging phase, records from both these tables are joined and only the matching records will be returned in the output.

The logical flow diagram for this methodology will be like this,

PHASE-I (Sorting)
   Read the records from table-A
   Sort the retrieved records of table-A by the join_key (which is nothing but the column(s) used for joining these two tables)
   Dump into temp_a (assume logically that temp_a is just a temporary memory space to hold this sorted data of table-A)

   Read the records from table-B
   Sort the retrieved records of table-B by the same join_key
   Dump into temp_b

PHASE-II (Merging)
   read 1st record from temp_a
   read 1st record from temp_b
   while [NOT End Of Record] on temp_a and temp_b
   loop
        if ( temp_a.join_key = temp_b.join_key )
                 then output joined record
        else if ( temp_a.join_key <> temp_b.join_key )
                 then don’t output the joined record
        end if
        if ( temp_a.join_key <= temp_b.join_key )
                 then read the next record in temp_a
        else if ( temp_a.join_key > temp_b.join_key )
                 then read the next record in temp_b
        end if
   end loop

LITTLE-KNOWN FACTS TO BE REMEMBERED:
  • ·         /*+ USE_MERGE(<<table-1>> <<table-2>>) */ is the hint that can be used to impose this sort merge join.
  • ·         It is completely untrue that Sort Merge join does only TABLE ACCESS FULL on both the tables because in some cases, it will access the index table also but only if that cost is less.
  • ·         In sort merge join, Sort phase will be skipped if the data is read from the index because it doesn’t require the sorting as the data is being read from the index is already in the sorted order. Sort phase will be imposed only if the data is read from the data table directly(TABLE ACCESS FULL) because the sorting is required here as the data is not coming out in the sorted order.
  • ·         This methodology is especially opted by Oracle when both the tables are joined by inequality operators like <,>,<=,>=. This is because Hash Join can’t be imposed when inequality operators are used and Nested Loop Join is definitely not an option if both these tables are too big.
  • ·         This is very successful if both the tables are bigger and outplays nested loop join in this case.

ADVANTAGE:
  • ·         If the columns mentioned in the ORDER BY clause of the sql statement is same as the joining columns, Optimizer prefers sort merge join over hash join as it is cheaper. Reason is, sorting doesn’t need to be done explicitly again as the data coming out from sort merge join would already be in the sorted order and that is not the case in hash join. In this case, you will not find this operation, SORT BY or ORDER BY in the explain plan which will confirm that explicit sorting is not done.

DISADVANTAGE:
  • ·         TEMP tablespace will be used extensively if the data read from both the tables don’t fit into the allocated memory space [which is exactly SORT_AREA_SIZE component of PGA] and query will start to throw errors if even TEMP tablespace is not big enough to hold the data from both the tables
  • ·         If you closely look at the logic of sort merge join, matching records start to get displayed only in the merging phase as none will be displayed till the sorting phase is completed. Due to this reason, this join has to be avoided in some OLTP applications if the initial matching records need to be displayed as quick as possible.
  • ·         This is very resource intense process (especially CPU is utilized a lot) since both the bigger tables need to be sorted and merged in the memory

HOW TO VERIFY:
How to verify whether Oracle follows sort merge join or not while executing the sql query. If a query follows this, then you will find similar execution plan like this,

--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                     | Name                                             | Rows  |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                   |                                                          |            |
|   1 |    MERGE JOIN                                              |                                                          |            |
|   2 |       SORT JOIN                                               |                                                          |            |
|   3 |          TABLE ACCESS (FULL)                       | <<table-a>>                                |            |
|   4 |       SORT JOIN                                               |                                                          |            |
|   5 |          TABLE ACCESS (FULL)                       | <<table-b>>                               |            |

In the explain plan, whenever it chooses this methodology, it displays the keyword (MERGE JOIN and SORT JOIN) in the operation column.

EXAMPLE:
Create 2 tables.
One table is store the information of the sales done by the employees for January Month and inserts 5 records.
Another table is store the information of the sales done by the employees for February Month and inserts 5 records.
The tables will look like this,

DATA TABLE:
EMP_JAN:
ROWID
Empid
empname
Sales_Amt
AAAAA1
5
CHARLES
2500
AAAAA2
7
FERGUSON
1000
AAAAA3
9
NADAL
3000
AAAAA4
4
ERIN
7000
AAAAA5
6
ROONEY
5500

EMP_FEB:
ROWID
Empid
empname
Sales_Amt
AAAAA6
9
NADAL
6000
AAAAA7
6
ROONEY
8500
AAAAA8
8
JOHN
4000
AAAAA9
7
FERGUSON
9000
AAAAA10
2
MIKE
4500

Fire this query against these tables where the requirement is to display all the employee name(s) along with their sales information whoever able to do the sales on both the months,
Select a.empname, a.sales_amt january_amt, b.sales_amt february_amt
from emp_jan a, emp_feb b
where a.empid = b.empid;

As per the design of sort merge join, the following steps will be followed while executing this query,
1.     Oracle reads all the records from EMP_JAN table and sorts it based on empid column. The intermediate result set will be like this, [as per sorting phase of logical flow diagram]
Empid
empname
Sales_Amt
4
ERIN
7000
5
CHARLES
2500
6
ROONEY
5500
7
FERGUSON
1000
9
NADAL
3000
2.     Oracle reads all the records from EMP_FEB table and sorts it based on empid column. The intermediate result set will be like this, [as per sorting phase of logical flow diagram]
Empid
Empname
Sales_Amt
2
MIKE
4500
6
ROONEY
8500
7
FERGUSON
9000
8
JOHN
4000
9
NADAL
6000
3.     It reads the first record from both these intermediate result sets (empid:4 from 1st & empid:2 from 2nd) [as per initial steps of merging phase in logical flow diagram]
4.     Since 4 is not equal to 2, the joined record will not be displayed [as per 1st IF clause of merging phase in logical flow diagram]
5.     It reads the next record from 2nd table since 4 is greater than 2 (empid:4 from 1st & empid:6 from 2nd) [as per 2nd IF clause of merging phase in logical flow diagram]
6.     Since 4 is not equal to 6, the joined record will not be displayed [as per 1st IF clause of merging phase in logical flow diagram]
7.     It reads the next record from 1st table since 4 is less than 6  (empid:5 from 1st & empid:6 from 2nd) [as per 2nd IF clause of merging phase in logical flow diagram]
8.     Since 5 is not equal to 6, the joined record will not be displayed [as per 1st IF clause of merging phase in logical flow diagram]
9.     It reads the next record from 1st table since 5 is less than 6 (empid:6 from 1st & empid:6 from 2nd) [as per 2nd IF clause of merging phase in logical flow diagram]
10.  Since 6 is equal to 6, the joined record will be displayed [as per 1st IF clause of merging phase in logical flow diagram]
11.  It reads the next record from 1st table since 6 is equal to 6 (empid:7 from 1st & empid:6 from 2nd) [as per 2nd IF clause of merging phase in logical flow diagram]
12.  Since 7 is not equal to 6, the joined record will not be displayed [as per 1st IF clause of merging phase in logical flow diagram]
13.  It reads the next record from 2nd table since 7 is greater than 6 (empid:7 from 1st & empid:7 from 2nd) [as per 2nd IF clause of merging phase in logical flow diagram]
14.  Since 7 is equal to 7, the joined record will be displayed [as per 1st IF clause of merging phase in logical flow diagram]
15.  It reads the next record from 1st table since 7 is equal to 7 (empid:9 from 1st & empid:7 from 2nd) [as per 2nd IF clause of merging phase in logical flow diagram]
16.  Since 9 is greater than 7, the joined record will not be displayed [as per 1st IF clause of merging phase in logical flow diagram]
17.  It reads the next record from 2nd table since 9 is greater than 7 (empid:9 from 1st & empid:8 from 2nd) [as per 2nd IF clause of merging phase in logical flow diagram]
18.  Since 9 is greater than 8, the joined record will not be displayed [as per 1st IF clause of merging phase in logical flow diagram]
19.  It reads the next record from 2nd table since 9 is greater than 8 (empid:9 from 1st & empid:9 from 2nd) [as per 2nd IF clause of merging phase in logical flow diagram]
20.  Since 9 is equal to 9, the joined record will be displayed [as per 1st IF clause of merging phase in logical flow diagram]
21.  It exits as all the records from both the tables are processed [as per WHILE condition of merging phase in logical flow diagram]

So, the output will look like this,
Empname
January_amt
February_amt
ROONEY
5500
8500
FERGUSON
1000
9000
NADAL
3000
6000

If you close look at the output, you can see that the output is displayed in the order of empid column (empids, 6,7 &9 of the employees, ROONEY, FERGUSON & NADAL accordingly). This is because of the outcome of sorting phase of sort merge join.

Now, explain table will look like this,
--------------------------------------------------------------------------------------------------------
| Id  | Operation                                                     | Name                                             | Rows  |
--------------------------------------------------------------------------------------------------------
|   6 | SELECT STATEMENT                                   |                                                          |  3          |
|   5 |    MERGE JOIN                                              |                                                          |  3          |
|   2 |       SORT JOIN                                               |                                                          |  5          |
|   1 |          TABLE ACCESS (FULL)                       | EMP_JAN                                     |  5          |
|   4 |       SORT JOIN                                               |                                                          |  5          |
|   3 |          TABLE ACCESS (FULL)                       | EMP_FEB                                      |  5          |

From the explain plan, we can say that both the tables are joined by sort merge natural join methodology.