JackHare is a framework that allows users to use ANSI-SQL queries to manipulate large-scale data in Apache HBase. It translates SQL queries to MapReduce jobs which are then executed on HBase. The framework takes a SQL query as input, parses it, looks up HBase metadata, generates corresponding MapReduce code, executes the job on HBase and returns results back in SQL format. Experimental results show that for analytical queries, JackHare performs competitively with traditional SQL databases on structured data and enables SQL-style queries on unstructured data in HBase.
2. Outline
• Introduction
• Related work
• The JackHare framework architecture
• Unstructured data processing in HBase
• Experimental results
• Conclusions
2
3. Introduction
• BigData 的問題 (massive data)
– 資料的存取速度
– 資料合併的問題
平行處理時資料的即時性、正確性。
• Hadoop MapReduce
– to process the massive data in parallel.
• Hadoop distributed file system
– difficult to update data frequently
3
4. Introduction
• Hbase
– to place the data over a scale-out storage system
– to manipulate the changeable data in a transparent
way
– the Hbase interface is not friendly
• JackHare
– 遵守ANSI-SQL和JDBC-4.0規格的API,用來操作
Apache Hbase
– using MapReduce framework for processing the
unstructured data in HBase
4
5. Introduction
• 資料的存取速度
– 1990, 硬碟可存1,370M,傳輸速度4.4MB/s
– 現在,1 TB,傳輸速度 100MB/s
– 平行進行資料讀取及寫入,加快速度
• Hadoop Distributed File System
– difficult to update data frequently in such file
system
5
11. Introduction
• JackHare
– allowing users to use the ANSI-SQL queries to
manipulate large-scale data
– 遵守ANSI-SQL和JDBC-4.0規格的API,用來操作
Apache Hbase
– using MapReduce framework for processing the
unstructured data in Hbase
11
12. Related work
• Pig
– HDFS 與 MapReduce 叢集環境中執行
– Pig Latin - a simpler procedural language
– http://pig.apache.org/docs/r0.12.0/basic.html#nest
edblock
• Hive
– 提供類似SQL的查詢語言來查詢資料(HiveQL)
– 可管理HDFS的資料
– https://cwiki.apache.org/confluence/display/Hive/T
utorial
12
13. Related work
• YSmart
– An SQL-to-MapReduce Translator
– http://ysmart.cse.ohio-state.edu/
• S2MART
– Smart Sql to Map-Reduce Translators
13
14. Related work
• HadoopDB
– An Architectural Hybrid of MapReduce and DBMS
Technologies for Analytical
– HadoopDB provides SQL query via a translation
called SQL-MR-SQL (SMS), based on Hive.
– http://db.cs.yale.edu/hadoopdb/hadoopdb.html
• Clydesdale
– structured data processing on MapReduce
– focuses on processing the data fitting a star schema
14
17. The JackHare framework architecture
• User submits an ANSI-SQL query by SQL client
application.
• The compiler scans and parses the ANSI-SQL
query.
• Lookup the related table name, column families
and column qualifier of HBase.
• Generate MapReduce code according to the
query commands and metadata.
17
18. The JackHare framework architecture
• Access HBase and execute the MapReduce job.
• The results wrapped back from the back-end.
• The returned results are shown on SQL client
application according to RDB schema.
18
22. Unstructured data processing in
HBase
• Analysis of SQL clauses
– SELECT, FROM and WHERE clauses
– Extended clauses
•
•
•
•
•
GROUP BY
HAVING
ORDER BY
JOIN
AGGREGATE FUNCTIONs
22
23. Experimental results
• Experimental environment
– two Intel Xeon L5640 CPU, 24 GB ram and
3 TB HD
– 16-node virtual machine cluster on four physical
machines
– Hadoop 0.20.203 (15 October, 2013: release 2.2.0 available)
– Hbase 0.92.0 (2013-09-20 | Version: 0.97.0-SNAPSHOT)
– Hive 0.9.0
– JAVA 1.6.0, maximum heap size is 512 MB
23
24. Experimental results
• Experimental environment
– Node : two cores at 2 GHz with 4 GB ram and 400
GB storage space
– MySQL : two cores at 2 GHz, 4 GB ram and
– 800 GB hard disk
– 3 Table : LOT, WAFER and DIE
24