2. Agenda Hive Overview Version 0.6 (released!) Version 0.7 (under development) Hive is now a TLP! Roadmaps
3. What is Hive? A Hadoop-based system for querying and managing structured data Uses Map/Reduce for execution Uses Hadoop Distributed File System (HDFS) for storage
4. Hive Origins Data explosion at Facebook Traditional DBMS technology could not keep up with the growth Hadoop to the rescue! Incubation with ASF, then became a Hadoop sub-project Now a top-level ASF project
5. SQL vs MapReduce hive> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2""$1}‘ $ cat > /tmp/map.sh awk -F '01' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1 $ bin/hadoop dfs –cat /tmp/largekey/part*
6. Hive Evolution Originally: a way for Hadoop users to express queries in a high-level language without having to write map/reduce programs Now more and more: A parallel SQL DBMS which happens to use Hadoop for its storage and execution architecture
7. Intended Usage Web-scale Big Data 100’s of terabytes Large Hadoop cluster 100’s of nodes (heterogeneous OK) Data has a schema Batch jobs for both loads and queries
8. So Don’t Use Hive If… Your data is measured in GB You don’t want to impose a schema You need responses in seconds A “conventional” analytic DBMS can already do the job (and you can afford it) You don’t have a lot of time and smart people
9. Scaling Up Facebook warehouse, Jan 2011: 2750 nodes 30 petabytes disk space Data access per day: ~40 terabytes added (compressed) 25000 map/reduce jobs 300-400 users/month
10. Facebook Deployment Web Servers Scribe MidTier Scribe-Hadoop Clusters Hive Replication Production Hive-Hadoop Cluster Archival Hive-Hadoop Cluster Adhoc Hive-Hadoop Cluster Sharded MySQL
13. Column Data Types Primitive Types integer types, float, string, boolean Nest-able Collections array<any-type> map<primitive-type, any-type> User-defined types structures with attributes which can be of any-type
14. Hive Query Language DDL {create/alter/drop} {table/view/partition} create table as select DML Insert overwrite QL Sub-queries in from clause Equi-joins (including Outer joins) Multi-table Insert Sampling Lateral Views Interfaces JDBC/ODBC/Thrift
15. Query Translation Example SELECT url, count(*) FROM page_views GROUP BY url Map tasks compute partial counts for each URL in a hash table “map side” pre-aggregation map outputs are partitioned by URL and shipped to corresponding reducers Reduce tasks tally up partial counts to produce final results
16. FROM (SELECT a.status, b.school, b.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid and a.ds='2009-03-20' ) ) subq1 INSERT OVERWRITE TABLE gender_summary PARTITION(ds='2009-03-20') SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender INSERT OVERWRITE TABLE school_summary PARTITION(ds='2009-03-20') SELECT subq1.school, COUNT(1) GROUP BY subq1.school
18. Behavior Extensibility TRANSFORM scripts (any language) Serialization+IPC overhead User defined functions (Java) In-process, lazy object evaluation Pre/Post Hooks (Java) Statement validation/execution Example uses: auditing, replication, authorization, multiple clusters
19. Map/Reduce Scripts Examples add file page_url_to_id.py; add file my_python_session_cutter.py; FROM (SELECT TRANSFORM(user_id, page_url, unix_time) USING 'page_url_to_id.py' AS (user_id, page_id, unix_time) FROM mylog DISTRIBUTE BY user_id SORT BY user_id, unix_time) mylog2 SELECT TRANSFORM(user_id, page_id, unix_time) USING 'my_python_session_cutter.py' AS (user_id, session_info);
20. UDF vs UDAF vs UDTF User Defined Function One-to-one row mapping Concat(‘foo’, ‘bar’) User Defined Aggregate Function Many-to-one row mapping Sum(num_ads) User Defined Table Function One-to-many row mapping Explode([1,2,3])
21. UDF Example add jar build/ql/test/test-udfs.jar; CREATE TEMPORARY FUNCTION testlength AS 'org.apache.hadoop.hive.ql.udf.UDFTestLength'; SELECT testlength(src.value) FROM src; DROP TEMPORARY FUNCTION testlength; UDFTestLength.java: package org.apache.hadoop.hive.ql.udf; public class UDFTestLength extends UDF { public Integer evaluate(String s) { if (s == null) { return null; } return s.length(); } }
22. Storage Extensibility Input/OutputFormat: file formats SequenceFile, RCFile, TextFile, … SerDe: row formats Thrift, JSON, ProtocolBuffer, … Storage Handlers (new in 0.6) Integrate foreign metadata, e.g. HBase Indexing Under development in 0.7
23. Release 0.6 October 2010 Views Multiple Databases Dynamic Partitioning Automatic Merge New Join Strategies Storage Handlers
24. Dynamic Partitions Automatically create partitions based on distinct values in columns INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country FROM page_view_stg pvs
25. Automatic merge Jobs can produce many files Why is this bad? Namenode pressure Downstream jobs have to deal with file processing overhead So, clean up by merging results into a few large files (configurable) Use conditional map-only task to do this
26. Join Strategies Old Join Strategies Map-reduce and Map Join Bucketed map-join Allows “small” table to be much bigger Sort Merge Map Join Deal with skew in map/reduce join Conditional plan step for skewed keys
27. Storage Handler Syntax HBase Example CREATE TABLE users( userid int, name string, email string, notes string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( “hbase.columns.mapping” = “small:name,small:email,large:notes”) TBLPROPERTIES ( “hbase.table.name” = “user_list”);
28. Release 0.7 Deployed in Facebook Stats Functions Indexes Local Mode Automatic Map Join Multiple DISTINCTs Archiving In development Concurrency Control Stats Collection J/ODBC Enhancements Authorization RCFile2 Partitioned Views Security Enhancements
29. Statistical Functions Stats 101 Stddev, var, covar Percentile_approx Data Mining Ngrams, sentences (text analysis) Histogram_numeric SELECT histogram_numeric(dob_year) FROM users GROUP BY relationshipstatus
36. Local Mode Execution Avoids map/reduce cluster job latency Good for jobs which process small amounts of data Let Hive decide when to use it set hive.exec.model.local.auto=true; Or force its usage set mapred.job.tracker=local;
37. Automatic Map Join Map-Join if small table fits in memory If it can’t, fall back to reduce join Optimize hash table data structures Use distributed cache to push out pre-filtered lookup table Avoid swamping HDFS with reads from thousands of mappers
38. Multiple DISTINCT Aggs Example SELECT view_date, COUNT(DISTINCT userid), COUNT(DISTINCT page_url) FROM page_views GROUP BY view_date
39. Archiving Use HAR (Hadoop archive format) to combine many files into a few Relieves namenode memory ALTER TABLE page_views {ARCHIVE|UNARCHIVE} PARTITION (ds=‘2010-10-30’)
40. Concurrency Control Pluggable distributed lock manager Default is Zookeeper-based Simple read/write locking Table-level and partition-level Implicit locking (statement level) Deadlock-free via lock ordering Explicit LOCK TABLE (global)
41. Statistics Collection Implicit metastore update during load Or explicit via ANALYZE TABLE Table/partition-level Number of rows Number of files Size in bytes
42. Hive is now a TLP PMC Namit Jain (chair) John Sichi Zheng Shao Edward Capriolo Raghotham Murthy Committers Amareshwari Sriramadasu Carl Steinbach Paul Yang He Yongqiang Prasad Chakka Joydeep Sen Sarma Ashish Thusoo Ning Zhang
43. Developer Diversity Recent Contributors Facebook, Yahoo, Cloudera Netflix, Amazon, Media6Degrees, Intuit, Persistent Systems Numerous research projects Many many more… Monthly San Francisco bay area contributor meetups India meetups ?
44. Roadmap: Heavy-Duty Tests Unit tests are insufficient What is needed: Real-world schemas/queries Non-toy data scales Scripted setup; configuration matrix Correctness/performance verification Automatic reports: throughput, latency, profiles, coverage, perf counters…
45. Roadmap: Shared Test Site Nightly runs, regression alerting Performance trending Synthetic workload (e.g. TPC-H) Real-world workload (anonymized?) This is critical for Non-subjective commit criteria Release quality
46. Roadmap: New Features Hive Server Stability/Deployment File Concatenation Reduce Number of Files Performance Bloom Filters Push Down Filters Cost Based Optimizer Column Level Statistics Plan should be based on Statistics