With the advent of big data, the enterprise analytics landscape has dramatically changed. The HDFS has become an important data repository for all business analytics. Enterprises are using various big data technologies to process data and drive actionable insights. HDFS serves as the storage where other distributed processing frameworks, such as Hadoop and Spark, access and operate on large volumes of data. At the same time, enterprise data warehouses (EDWs) continue to support critical business analytics. EDWs are usually shared-nothing parallel databases that support complex SQL processing, updates, and transactions. As a result, they manage up-to-date data and support various business analytics tools, such as reporting and dashboards. A new generation of applications have emerged, requiring access and correlation of data stored in HDFS and EDWs. This has created the need for a new generation of a special federation between Hadoop-like big data platforms and EDWs, which we call the hybrid warehouse. In this talk, we identify the best hybrid warehouse architecture by studying various algorithms to join database and HDFS tables.
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and Enterprise Data Warehouse (EDW)
1. Building A Hybrid Warehouse:
Efficient Joins between Data
Stored in HDFS and Enterprise
Warehouse
YUANYUAN TIAN (YTIAN@US.IBM.COM)
IBM RESEARCH -- ALMADEN
Publications: Tian et al EDBT 2015, Tian et al TODS 2016 (invited as Best of EDBT 2015)
2. Big Data in The Enterprise
Hadoop + Spark
ETL/
ELT
Graph
ML
Analytics
Streams SQL
HDFS
social
SQL Queries
EDW
SQL
Engine
3. Example Scenario
SELECT L.url_prex, COUNT(*)
FROM Transaction T, Logs L
WHERE T.category = ‘Canon Camera’
AND region(L.ip)= ‘East Coast’
AND T.uid=L.uid
AND T.date >= L.date AND T.date <= L.date+1
GROUP BY L.url_prex
Find out the number of views of the urls visited by customers
with IP addresses from East Coast who bought Canon
Camera within one day of their online visits
Hadoop + Spark
SQL
HDFS
EDW
Logs Transactions
Table L Table T
Correlate customers’ online behaviors with sales
4. Hybrid Warehouse
What is a Hybrid Warehouse?
A special federation between Hadoop-like big data platforms and EDWs
Two asymmetric, heterogeneous, and independent distributed systems.
Existing federation solutions are inadequate
Client-server model to access remote databases and move data
Single connection for data transmission
EDW SQL-on-Hadoop
Data Ownership • Own its data
• Control data organization and partitioning
• Work with existing files on HDFS
• Cannot dictate data layout
Index Support Build and exploit index Scan-based only, no index support
Update Support update-in-place Append only
Capacity • High-end servers
• Smaller cluster size
• Commodity machines
• Larger cluster size (up to 10,000s nodes)
5. Joins in Hybrid Warehouse
Focus an equi-join between two big tables in the hybrid warehouse
Table T in an EDW (a shared-nothing full-fledged parallel database)
Table L on HDFS, with a scan-based distributed data processing engine (HQP)
Both tables are large, but generally |L|>>|T|
Data not distributed/partitioned by join key at either side
Queries are issued and results are returned at EDW side
Final result is relatively small due to aggregation
SELECT L.url_prex, COUNT(*)
FROM Transaction T, Logs L
WHERE T.category = ‘Canon Camera’
AND region(L.ip)= ‘East Coast’
AND T.uid=L.uid
AND T.date >= L.date AND T.date <= L.date+1
GROUP BY L.url_prex
6. Existing Hybrid Solutions
Data of one system is entirely loaded into the other.
DB HDFS: DB data gets updated frequently, HDFS doesn’t support update properly
HDFS DB: HDFS data is often too big to be moved into DB
Dynamically ingest needed data from HDFS into DB
e.g. Microsoft Polybase, Pivotal Hawq, TeraData SQL-H, Oracle Big Data SQL
Selection and projection pushdown to the HDFS side
Joins executed in the DB side only
Heavy burden on the DB side
Assume that SQL-on-Hadoop systems are not efficient at join processing
NOT TRUE ANYMORE! (IBM Big SQL, Impala, Presto, etc.)
Split querying processing between DB and HDFS
Microsoft Polybase
Joins executed in Hadoop, only when both tables are on HDFS
7. Goals and Contributions
Goals:
Fully utilize the processing power and massive parallelism of both systems
Minimize data movement across the network
Exploit the use of Bloom filters
Consider performing joins both at the DB side and the HDFS side
Contributions:
Adapt and extend well-know distributed join algorithms to work in the hybrid warehouse
Propose a new zigzag join algorithm that is proved to work well in most cases
Implement the algorithms in a prototype of the hybrid warehouse architecture with DB2 DPF and our
join engine on HDFS
Empirically compare all join algorithms in different selectivity settings
Develop a sophisticated cost model for all the join algorithms
8. DB-Side Join
Move the HDFS data after selection & projection to DB
Used in most existing hybrid systems: Polybase, Pivotal Hawq. etc
HDFS table after selection & projection can still be big
Bloom filters to exploit join selectivity
9. HDFS-Side Broadcast Join
If the DB table after selection & projection is very small
Broadcast the DB table to HDFS side to avoid shuffling the HDFS table.
10. HDFS-Side Repartition Join
When the DB table after selection & projection is still large
Both sides agree on a hash function for data shuffling
Bloom filter to exploit join selectivity
12. Implementation
EDW: DB2 DPF extended with unfenced C UDFs
Computing & applying Bloom filters
Different ways of transferring data between DB2 and JEN
HQP: Our own C++ join execution engine, called JEN
Sophisticated HDFS-side join engine using multi-threading, pipelining, hash-based aggregations, etc
Coordination between DB2 and the HDFS-side engine
Parallel communication layer between DB2 and the HDFS-side engine
HCatalog: For storing the meta data of HDFS tables
Each join algorithm is invoked by issuing a single query to DB2
13. JEN Overview
Built with a prototype of the IO layer and the scheduler from an early version of IBM Big SQL 3.0
A JEN cluster consists one coordinator and n workers
Each JEN worker:
Multi-threaded, run on each HDFS DataNode
Read parts of HDFS tables (leveraging IO layer of IBM Big SQL 3.0)
Execute local query plans
Communicate in parallel with other JEN workers (MPI-based)
Communicate in parallel with DB2 agents : through TCP/IP sockets.
JEN coordinator:
Manage JEN workers
Orchestrate connection and communication between JEN workers and DB2 agents
Retrieve meta data for HDFS tables
Assign HDFS blocks to JEN workers (leveraging the scheduler of IBM Big SQL 3.0)
14. Experimental Setup
HDFS cluster:
30 DataNodes, each runs 1 JEN worker
Each server: 8 cores, 32 GB RAM, 1 Gbit Ethernet, 4 disks for HDFS
DB2 DPF:
5 severs, each runs 6 DB2 agents
Each server: 12 cores, 96 GB RAM, 10 Gbit Ethernet, 11 disks for DB2 data storage
Interconnection: 20Gbit switch
Dataset:
Log table L on HDFS (15 billion records)
1TB in text format
421GB in Parquet format (default)
Transaction table T in DB2 (1.6 billion records, 97GB)
Bloom filter: 128 million bits with 2 hash function
# join keys: 16 million
false positive rate: 5%
15. DB-Side Joins vs HDFS-Side Joins
DB-Side joins work well only when selectivity on
L is small (σL<= 0.01 ).
HDFS-side joins show very steady performance
with increasing L’.
HDFS-side join (especially zigzag join) is a very
reliable choice for joins in the hybrid warehouse!
Transaction Table Selectivity = 0.1
DB-side joins deteriorate fast !
16. Broadcast Join vs Repartition Join
Broadcast join only works for very limited cases, e.g when σT<= 0.001 (T’ <=25MB).
Tradeoff: broadcasting T’ (30*T’) via interconnection vs sending T’ via interconnection + shuffling
L’ within HDFS cluster
0
50
100
150
200
250
0.001 0.01 0.1 0.2
Time(sec)
Log Table Selectivity
broadcast repartition
0
50
100
150
200
250
0.001 0.01 0.1 0.2
Time(sec)
Log Table Selectivity
broadcast repartition
Transaction table selectivity = 0.001 Transaction table selectivity = 0.01
17. Zigzag Join vs Repartition Joins
Transaction Table Selectivity = 0.1
HDFS tuples shuffled DB tuples sent
Repartition 5,854 million 165 million
Repartition (BF) 591 million 165 million
Zigzag 591 million 30 million
Zigzag join is most efficient
Up to 2.1x faster than repartition join, up to 1.8x faster than repartition
join with BF
Zigzag join significantly reduces data movement
9.9x less HDFS data shuffled, 5.5x less DB data sent
Zigzag join is the best HDFS-side join algorithm!
18. Cost Model of Join Algorithms
Goal:
Capture the relative performance of the join algorithms
Enable a query optimizer in the hybrid warehouse to choose the right join strategy
Estimate total resource time (disk IO, network IO, CPU) in milliseconds
Parameters used in cost formulas:
System parameters: only related to the system environment (common to all queries)
# DB nodes, # HDFS nodes, DB buffer pool size, disk IO speeds (DB, HDFS), network IO speeds (DB, HDFS, in-between), etc
Estimated through a learning suite which runs a number of test programs
Query parameters: query-specific parameters
Table cardinalities, table sizes, local predicate selectivity, join selectivity, Bloom filter size, Bloom filter false-positive rate, etc
DB table: leverage DB stats
HDFS table: estimate through sampling or Hive Analyze Table command if possible
Join selectivity: estimate through sampling
19. Validation of Cost Models
Selectivity
on T
Selectivity
on L
Join Selectivity
on T
Join Selectivity
on L
Best from
Cost Model
Best from
Experiment
Intersection
Metric
0.05 0.001 0.0005 0.05 db(BF) db(BF) 0
0.05 0.01 0.005 0.05 db(BF) db(BF) 0.18
0.05 0.1 0.05 0.05 zigzag zigzag 0.08
0.05 0.2 0.1 0.05 zigzag zigzag 0
0.1 0.001 0.0005 0.1 db(BF) db(BF) 0
0.1 0.01 0.005 0.1 db(BF) db(BF) 0.18
0.1 0.1 0.05 0.1 zigzag zigzag 0.14
0.1 0.2 0.1 0.1 zigzag zigzag 0.06
Cost model correctly finds the best algorithm in every case!
Even the ranking of the algorithms is similar or identical to that of empirical observation!
20. Concluding Remarks
Emerging need for hybrid warehouse: enterprise warehouses will co-exist with big data systems
Bloom filter is a good way to filter data, and use them both ways
Powerful SQL processing capability on the HDFS side
IBM Big SQL, Impala, Hive 14, …
Existing SQL-on-Hadoop systems can be augmented with the capabilities of JEN
More capacity and investment on the big data side
Exploit capacity without moving data
It is better to do the joins on the Hadoop side
More complex usage patterns are emerging
EDW on premise, Hadoop on cloud
Notes de l'éditeur
With the advent of big dats, the enterprise analytics landscape has dramatically changed. The HDFS has become an important data repository for all business analytics. Enterprises are using various big data technologies to process data and drive actionable insights. HDFS serves as the storage where other distributed processing frameworks, such as Hadoop and spark, access and operate on large volumes of data. At the same time, enterprise data warehouses (EDWs) continue to support critical business analytics. EDWs are usually shared-nothing parallel databases that support complex SQL processing, updates, and transactions. As a result, they manage up-to-date data and support various business analytics tools, such as reporting and dashboards. A new generation of applications have emerged, requiring access and correlation of data stored in HDFS and EDWs.
For example, a company running an ad campaign may want to evaluate the effectiveness of its campaign by correlating click stream data stored in HDFS with actual sales data stored in the database. This requires joining the transaction table T in the parallel database with the log table L on HDFS. Such analysis can be expressed as the following SQL query.
These applications, together with the coexistence of HDFS and EDWs, have created the need for a new generation of a special federation between Hadoop-like big data platforms and EDWs, which we call the hybrid warehouse. It is very important to highlight the unique challenges of the hybrid warehouse. First of all, we are dealing with two asymmetric, heterogeneous, and independent distributed systems. A full-fledged database and a sql-on Hadoop processor have very different characteristics.
In thiswork, we envision an architecture of the hybrid warehouse by studying the important problem of efficiently executing joins between HDFS and EDW data
Many database/HDFS hybrid systems fetch the HDFS table and execute the join in the database. We first explore this approach, which we call DB-side join. Note that the HDFS table L is usually much larger than the database table T. Even if the local predicates predL are highly selective, the filtered HDFS table L can still be quite big. In order to further reduce the amount of data transferred from HDFS to the parallel database, we introduce a Bloom filter bfT on the join key of T , which is the database table after applying local predicates and projection, and send the Bloom filter to the HDFS side. T
We now consider executing the join at the HDFS side. If the predicates predT on the database table T are highly selective, the filtered database data T is small enough to be sent to every HQP node, so that only local joins are needed without any shuffling of the HDFS data.
If the local predicates predT over the database table T are not highly selective, then broadcasting the filtered data T to all HQP nodes is not a good strategy. In this case, we need a robust join algorithm.
When local predicates on neither the HDFS table nor the database table are selective, we need to fully exploit the join selectivity to perform the join efficiently. We can even further reduce the amount of data movement between the two systems by exploiting bloom filters in both ways.