This document discusses different techniques for performing joins in data warehousing, including nested loop joins, sort-merge joins, and hash joins. It provides code examples and diagrams to illustrate how each type of join works. Specifically, it explains that nested loop joins examine each row in one table to find matching rows in the other table, while sort-merge joins first sort both tables on the join key before merging to find matches. Hash joins hash one table and probe it with the other to identify matches. The document also discusses factors that affect the performance of each join technique, such as table order and skew.
1. Data WarehousingData Warehousing
11
Data WarehousingData Warehousing
Lecture-28Lecture-28
Need for Speed: Join TechniquesNeed for Speed: Join Techniques
Virtual University of PakistanVirtual University of Pakistan
Ahsan Abdullah
Assoc. Prof. & Head
Center for Agro-Informatics Research
www.nu.edu.pk/cairindex.asp
National University of Computers & Emerging Sciences, Islamabad
Email: ahsan101@yahoo.com
5. Data Warehousing
5
FOR i = 1 to N DO BEGINFOR i = 1 to N DO BEGIN /*/* N rows in T1N rows in T1 */*/
IF iIF ithth
row of T1 qualifies THEN BEGINrow of T1 qualifies THEN BEGIN
For j = 1 to M DO BEGINFor j = 1 to M DO BEGIN /* M rows in T2/* M rows in T2 */*/
IF the iIF the ithth
row of T1 matches to jrow of T1 matches to jthth
row of T2 on join keyrow of T2 on join key
THEN BEGINTHEN BEGIN
IF the jIF the jthth
row of T2 qualifies THEN BEGINrow of T2 qualifies THEN BEGIN
produce output rowproduce output row
ENDEND
ENDEND
ENDEND
ENDEND
ENDEND
Nested-Loop Join: CodeNested-Loop Join: Code
GOES TO GRAPHICSGOES TO GRAPHICS
6. Data Warehousing
6
““What is the average GPA ofWhat is the average GPA of
undergraduate male students?”undergraduate male students?”
For each qualifying row of Personal table,
Academic table is examined for matching rows.
Student Personal Table Student Academic Table
298-----------------
----------------------
----------------------
62------------------
----------------------
----------------------
440------------------
Nested-Loop Join: Working ExampleNested-Loop Join: Working Example
Results
Search
Results
Search
Results
Search
GOES TO GRAPHICSGOES TO GRAPHICS
8. Data Warehousing
8
Nested-Loop Join: Cost FormulaNested-Loop Join: Cost Formula
Join cost =Join cost = Cost of accessing Table_A +
# of qualifying rows in Table_A × Blocks of
Table_B to be scanned for each qualifying row
OR
Join cost =Join cost = Blocks accessed for Table_A +
Blocks accessed for Table_A × Blocks
accessed for Table_B
GOES TO GRAPHICSGOES TO GRAPHICS
9. Data Warehousing
9
Nested-Loop Join: Cost of reorderNested-Loop Join: Cost of reorder
Table_A = 500 blocks and
Table_B = 700 blocks.
Qualifying blocks for Table_A QB(A) = 50
Qualifying blocks for Table_B QB(B) = 100
Join cost A&B = 500 + 50×700 = 35,500 I/Os
Join cost B&A = 700 + 100×500 = 50,700 I/Os
i.e. an increase in I/O of about 43%.
GOES TO GRAPHICSGOES TO GRAPHICS
17. Data Warehousing
17
Hash-Based Join: ExampleHash-Based Join: Example
Table_B on disk
DiskDisk
Original
Relation
Table_A
hash
function
h
Join Result
. . .
Table_B
M N
N
2
1
.
.
.
1
2
.
.
.
Table_A in main memory
MAIN MEMORY
GOES TO GRAPHICSGOES TO GRAPHICS
<number>
Pictorial representation of the nested loop join algorithm.
Implementation Strategies [3]
Top Down approach: It is generally useful for projects where the technology is mature and well understood, as well as where the business problems that must be solved are clear and well understood.
A Bottom Up approach,: is useful, on the other hand, in making technology assessments and is a good technique for organizations that are not leading edge technology implementers. This approach is used when the business objectives that are to be met by the data warehouse are unclear, or when the current or proposed business process will be affected by the data warehouse.
Development Methodologies
A Development Methodology describes the expected evolution and management of the engineering system.
Waterfall Model: The model is a linear sequence comprised of the stages like requirements definition, system design, detailed design, integration and testing, and finally operations and maintenance. This model is used when the system requirements and objectives are known and clearly specified.
· Spiral Model: The model is a sequence of waterfall models which corresponds to a risk oriented iterative enhancement, and it recognizes that requirements are not always available and clear when the system is first implemented [3].
RAD: Rapid Application Development (RAD) is an iterative model consisting of steps like scope, analyze, design, construct, test, implement, and review .
Since designing and building a data warehouse is an iterative process, the spiral method is the best development methodology [3] .
The iterative RAD process is much better suited to the development of a data warehouse. Development and delivery of early prototypes will drive future requirements as business users are given direct access to information and the ability to manipulate it. Management of expectations requires that the content of the data warehouse be clearly communicated for each iteration [4].
While one can use the traditional waterfall approach to developing a data warehouse, there are several drawbacks. First and foremost, the project is likely to occur over an extended period of time, during which the users may not have had an opportunity to review what will be delivered. Second, in today's demanding competitive environment there is a need to produce results in a much shorter timeframe [4].