SlideShare une entreprise Scribd logo
1  sur  33
Benchmarking Hive at Yahoo Scale
P R E S E N T E D B Y M i t h u n R a d h a k r i s h n a n ⎪ J u n e 1 8 , 2 0 1 4
H a d o o p U s e r G r o u p
About myself
2
 HCatalog Committer, Hive
contributor
› Metastore, Notifications, HCatalog APIs
› Integration with Oozie, Data Ingestion
 Other odds and ends
› DistCp
 mithun@apache.org
Hadoop User Group, 201406181830, Yahoo Sunnyvale
About this talk
3
 Introduction to “Yahoo Scale”
 The use-case in Yahoo
 The Benchmark
 The Setup
 The Observations (and, possibly, lessons)
 Fisticuffs
Hadoop User Group, 201406181830, Yahoo Sunnyvale
The Y!Grid
4
 16 Hadoop Clusters in YGrid
› 32500 Nodes
› 750K jobs a day
 Hadoop 0.23.10.x, 2.4.x
 Large Datasets
› Daily, hourly, minute-level frequencies
› Terabytes of data, 1000s of files, per dataset instance
 Pig 0.11
 Hive 0.10 / HCatalog 0.5
› => Hive 0.12
Hadoop User Group, 201406181830, Yahoo Sunnyvale
Data Processing Use cases
5 Hadoop User Group, 201406181830, Yahoo Sunnyvale
 Pig for Data Pipelines
› Imperative paradigm
› ~45% Hadoop Jobs on Production Clusters
• M/R + Oozie = 41%
 Hive for Ad hoc queries
› SQL
› Relatively smaller number of jobs
• *Major* Uptick
 Use HCatalog for Inter-op
6 Yahoo Confidential & Proprietary
Hive is Currently the Fastest Growing Product on the Grid
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
9.0%
10.0%
0
5
10
15
20
25
30
Mar-13 Apr-13 May-13 Jun-13 Jul-13 Aug-13 Sep-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Mar-14 Apr-14 May-14
HiveJobs(%ofAllJobs)
AllGridJobs(inMillions)
All Jobs Hive (% of all jobs)
2.4 million
Hive jobs
Business Intelligence Tools
7
 {Tableau, MicroStrategy, Excel, … }
 Challenges:
› Security
• ACLs, Authentication, Encryption over the wire, Full-disk Encryption
› Bandwidth
• Transporting results over ODBC
› Query Latency
• Query execution time
• Cost of query “optimizations”
• “Bad” queries
Hadoop User Group, 201406181830, Yahoo Sunnyvale
The Benchmark
8
 TPC-h
› Industry standard (tpc.org/tpch)
› 22 queries
› dbgen –s 1000 –S 3
• Parallelizable
 Reynold Xin’s excellent work:
› https://github.com/rxin
› Transliterated queries to suit Hive 0.9
Hadoop User Group, 201406181830, Yahoo Sunnyvale
Relational Diagram
9 Hadoop User Group, 201406181830, Yahoo Sunnyvale
PARTKEY
NAME
MFGR
BRAND
TYPE
SIZE
CONTAINER
COMMENT
RETAILPRICE
PARTKEY
SUPPKEY
AVAILQTY
SUPPLYCOST
COMMENT
SUPPKEY
NAME
ADDRESS
NATIONKEY
PHONE
ACCTBAL
COMMENT
ORDERKEY
PARTKEY
SUPPKEY
LINENUMBER
RETURNFLAG
LINESTATUS
SHIPDATE
COMMITDATE
RECEIPTDATE
SHIPINSTRUCT
SHIPMODE
COMMENT
CUSTKEY
ORDERSTATUS
TOTALPRICE
ORDERDATE
ORDER-
PRIORITY
SHIP-
PRIORITY
CLERK
COMMENT
CUSTKEY
NAME
ADDRESS
PHONE
ACCTBAL
MKTSEGMENT
COMMENT
PART (P_)
SF*200,000
PARTSUPP (PS_)
SF*800,000
LINEITEM (L_)
SF*6,000,000
ORDERS (O_)
SF*1,500,000
CUSTOMER (C_)
SF*150,000
SUPPLIER (S_)
SF*10,000
ORDERKEY
NATIONKEY
EXTENDEDPRICE
DISCOUNT
TAX
QUANTITY
NATIONKEY
NAME
REGIONKEY
NATION (N_)
25
COMMENT
REGIONKEY
NAME
COMMENT
REGION (R_)
5
The Setup
10
› 350 Node cluster
• Xeon boxen: 2 Slots with E5530s => 16 CPUs
• 24GB memory
– NUMA enabled
• 6 SATA drives, 2TB, 7200 RPM Seagates
• RHEL 6.4
• JRE 1.7 (-d64)
• Hadoop 0.23.7+/2.3+, Security turned off
• Tez 0.3.x
• 128MB HDFS block-size
› Downscale tests: 100 Node cluster
• hdfs-balancer.sh
Hadoop User Group, 201406181830, Yahoo Sunnyvale
The Prep
11
 Data generation:
› Text data: dbgen on MapReduce
› Transcode to RCFile and ORC: Hive on MR
• insert overwrite table orc_table partition( … ) select * from text_table;
› Partitioning:
• Only for 1TB, 10TB cases
• Perils of dynamic partitioning
› ORC File:
• 64MB stripes, ZLIB Compression
Hadoop User Group, 201406181830, Yahoo Sunnyvale
Observations
13 Hadoop User Group, 201406181830, Yahoo Sunnyvale
0
500
1000
1500
2000
2500
q1_pricing_summary_report.hive
q2_minimum_cost_supplier.hiveq3_shipping_priority.hive
q4_order_priority
q5_local_supplier_volume.hive
q6_forecast_revenue_change.hiveq7_volume_shipping.hive
q8_na
onal_market_share.hive
q9_product_type_profit.hiveq10_returned_item.hiveq11_important_stock.hive
q12_shipping.hive
q13_customer_distribu
on.hive
q14_promo
on_effect.hive
q15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_small_quan
ty_order_revenue.hive
q18_large_volume_customer.hive
q19_discounted_revenue.hive
q20_poten
al_part_promo
on.hive
q21_suppliers_who_kept_orders_waing.hive
q22_global_sales_opportunity.hive
Time(inseconds)
TPC-h 100GB
Hive 0.10 (Text)
Hive 0.10 RCFile
Hive 0.11 ORC
Hive 0.13 ORC MR
Hive 0.13 ORC Tez
100 GB
14
› 18x speedup over Hive 0.10 (Textfile)
• 6-50x
› 11.8x speedup over Hive 0.10 (RCFile)
• 5-30x
› Average query time: 28 seconds
• Down from 530 (Hive 0.10 Text)
› 85% queries completed in under a minute
Hadoop User Group, 201406181830, Yahoo Sunnyvale
15 Hadoop User Group, 201406181830, Yahoo Sunnyvale
0
500
1000
1500
2000
2500
q1_pricing_summary_report.hive
q2_minimum_cost_supplier.hive
q3_shipping_priority.hive
q4_order_priority
q5_local_supplier_volume.hive
q6_forecast_revenue_change.hive
q7_volume_shipping.hive
q8_na
onal_market_share.hive
q9_product_type_profit.hiveq10_returned_item.hive
q11_important_stock.hive
q12_shipping.hive
q13_customer_distribu
on.hive
q14_promo
on_effect.hiveq15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_small_quan
ty_order_revenue.hive
q18_large_volume_customer.hive
q19_discounted_revenue.hive
q20_poten
al_part_promo
on.hive
q21_suppliers_who_kept_orders_waing.hive
q22_global_sales_opportunity.hive
Time(inseconds)
TPC-h 1TB
Hive 0.10 RC File
Hive 0.11 ORC
Hive 0.12 ORC
Hive 0.13 ORC MR
Hive 0.13 ORC Tez
1 TB
16
› 6.2x speedup over Hive 0.10 (RCFile)
• Between 2.5-17x
› Average query time: 172 seconds
• Between 5-947 seconds
• Down from 729 seconds (Hive 0.10 RCFile)
› 61% queries completed in under 2 minutes
› 81% queries completed in under 4 minutes
Hadoop User Group, 201406181830, Yahoo Sunnyvale
17 Hadoop User Group, 201406181830, Yahoo Sunnyvale
0
2000
4000
6000
8000
10000
12000
q1_pricing_summary_report.hiveq2_minim
um_cost_supplier.hive
q3_shipping_priority.hive
q4_order_priorityq5_local_supplier_volume.hive
q6_forecast_revenue_change.hive
q7_volume_shipping.hiveq8_na
onal_market_share.hive
q9_product_type_profit.hive
q10_returned_item.hive
q11_im
portant_stock.hive
q12_shipping.hiveq13_customer_distribu
on.hive
q14_promo
on_effect.hive
q15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_small_quan
ty_order_revenue.hiveq18_large_volume_customer.hiveq19_discounted_revenue.hive
q20_poten
al_part_promo
on.hive
q21_suppliers_who_kept_orders_waing.hive
q22_global_sales_opportunity.hive
Time(inseconds)
TPC-h10TB
Hive0.10RCFile
Hive0.11ORC
Hive0.12ORC
Hive0.13ORCMR
Hive0.13ORCTez
10 TB
18
› 6.2x speedup over Hive 0.10 (RCFile)
• Between 1.6-10x
› Average query time: 908 seconds (426 seconds excluding outliers)
• Down from 2129 seconds with Hive 0.10 RCFile
– (1712 seconds excluding outliers)
› 61% queries completed in under 5 minutes
› 71% queries completed in under 10 minutes
› Q6 still completes in 12 seconds!
Hadoop User Group, 201406181830, Yahoo Sunnyvale
Explaining the speed-ups
19
 Hadoop 2.x, et al.
 Tez
› (Arbitrary DAG)-based Execution Engine
› “Playing the gaps” between M&R
• Temporary data and the HDFS
› Feedback loop
› Smart scheduling
› Container re-use
› Pipelined job start-up
 Hive
› Statistics
› “Vector-ized” Execution
 ORC
› PPD
Hadoop User Group, 201406181830, Yahoo Sunnyvale
20 Hadoop User Group, 201406181830, Yahoo Sunnyvale
0
100
200
300
400
500
600
700
800
900
1000
q1_pricing_sum
mary_report.hive
q2_m
inim
um
_cost_supplier.hive
q3_shipping_priority.hiveq4_order_priority
q5_local_supplier_volume.hive
q6_forecast_revenue_change.hive
q7_volume_shipping.hive
q8_na
onal_m
arket_share.hive
q9_product_type_profit.hive
q10_returned_item
.hive
q11_im
portant_stock.hiveq12_shipping.hive
q13_customer_distribu
on.hive
q14_promo
on_effect.hive
q15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_small_quan
ty_order_revenue.hive
q18_large_volum
e_customer.hive
q19_discounted_revenue.hive
q20_poten
al_part_prom
o
on.hive
q21_suppliers_who_kept_orders_waing.hive
q22_global_sales_opportunity.hive
Time(inseconds)
Vectoriza on
Hive 0.13 Tez ORC
Hive 0.13 Tez ORC Vec
21 Hadoop User Group, 201406181830, Yahoo Sunnyvale
ORC File Layout
 Data is composed of multiple streams
per column
 Index allows for skipping rows (default to
every 10,000 rows), keeping position in
each stream, and min-max for each
column
 Footer contains directory of stream
locations, and the encoding for each
column
 Integer columns are serialized using run-
length encoding
 String columns are serialized using
dictionary for column values, and the
same run length encoding
 Stripe footer is used to find the requested
column’s data streams and adjacent
stream reads are merged File Footer
Postscript
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Stream 2.1
Stream 2.2
Stream 2.3
Stream 2.4
22 Hadoop User Group, 201406181830, Yahoo Sunnyvale
ORC Usage
CREATE TABLE addresses (
name string,
street string,
city string,
state string,
zip int
)
STORED AS orc TBLPROPERTIES ("orc.compress"= "ZLIB");
LOCATION ‘/path/to/addresses’;
ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT orc
SET hive.default.fileformat = orc
SET hive.exec.orc.memory.pool = 0.50 (ORC writer is allowed 50% of JVM heap size by default)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde’
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';
Key Default Comments
orc.compress ZLIB high-level compression (one of NONE, ZLIB, Snappy)
orc.compress.size 262,144 (256 KB) number of bytes in each compression chunk
orc.stripe.size 67,108,864 (64 MB) number of bytes in each stripe. Each ORC stripe is processed in one map task (try 32
MB to cut down on disk I/O)
orc.row.index.stride 10,000 number of rows between index entries (must be >= 1,000). A larger stride-size
increases the probability of not being able to skip the stride, for a predicate.
orc.create.index true whether to create row indexes. This is for predicate push-down. If data is frequently
accessed/filtered on a certain column, then sorting on the column and using index-filters
makes column filters work faster
23 Hadoop User Group, 201406181830, Yahoo Sunnyvale
0
100
200
300
400
500
600
700
800
900
1000
q1_pricing_summary_report.hive
q2_minimum_cost_supplier.hive
q3_shipping_priority.hive
q4_order_priority
q5_local_supplier_volume.hive
q6_forecast_revenue_change.hive
q7_volume_shipping.hive
q8_na
onal_market_share.hive
q9_product_type_profit.hive
q10_returned_item.hive
q11_important_stock.hive
q12_shipping.hive
q13_customer_distribu
on.hive
q14_promo
on_effect.hiveq15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_small_quan
ty_order_revenue.hive
q18_large_volume_customer.hive
q19_discounted_revenue.hive
q20_poten
al_part_promo
on.hive
q21_suppliers_who_kept_orders_waing.hive
q22_global_sales_opportunity.hive
Time(inseconds)
Effects of Compression (1TB)
Hive 0.13 Uncompressed ORC
Hive 0.13 ZLIB Compressed
24 Hadoop User Group, 201406181830, Yahoo Sunnyvale
0
500
1000
1500
2000
2500
3000
q1_pricing_summary_report.hive
q2_minimum_cost_supplier.hive
q3_shipping_priority.hive
q4_order_priority
q5_local_supplier_volume.hive
q6_forecast_revenue_change.hive
q7_volume_shipping.hive
q8_na
onal_market_share.hive
q9_product_type_profit.hiveq10_returned_item.hive
q11_important_stock.hive
q12_shipping.hive
q13_customer_distribu
on.hive
q14_promo
on_effect.hiveq15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_small_quan
ty_order_revenue.hive
q18_large_volume_customer.hive
q19_discounted_revenue.hive
q20_poten
al_part_promo
on.hive
q21_suppliers_who_kept_orders_waing.hive
q22_global_sales_opportunity.hive
Time(inseconds)
Effects of Compression (10TB)
Hive 0.13 Uncompressed
Hive 0.13 Compressed
Configuring ORC
25
 set hive.merge.mapredfiles=true
 set hive.merge.mapfiles=true
 set orc.stripe.size=67,108,864
› Half the HDFS block-size
• Tangent: nStripes vs nBlocks
• Tangent: DistCp
 set orc.compress=???
› Depends on size and distribution
› Snappy compression hasn’t been explored
 YMMV
› Experiment
Hadoop User Group, 201406181830, Yahoo Sunnyvale
26 Hadoop User Group, 201406181830, Yahoo Sunnyvale
0
100
200
300
400
500
600
700
800
900
1000
q1_pricing_sum
m
ary_report.hive
q2_m
inim
um
_cost_supplier.hive
q3_shipping_priority.hive
q4_order_priority
q5_local_supplier_volum
e.hive
q6_forecast_revenue_change.hive
q7_volum
e_shipping.hive
q8_na
onal_m
arket_share.hive
q9_product_type_profit.hive
q10_returned_item
.hive
q11_im
portant_stock.hive
q12_shipping.hive
q13_custom
er_distribu
on.hive
q14_prom
o
on_effect.hive
q15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_sm
all_quan
ty_order_revenue.hive
q18_large_volum
e_custom
er.hive
q19_discounted_revenue.hive
q20_poten
al_part_prom
o
on.hive
q21_suppliers_w
ho_kept_orders_w
ai
ng.hive
q22_global_sales_opportunity.hive
Time(inseconds)
100 vs 350 Nodes
Hive 0.13 100 Nodes
Hive 0.13 350 Nodes
Conclusions
Y!Grid sticking with Hive
28
 Familiarity
› Existing ecosystem
 Community
 Scale
 Multitenant
 Coming down the pike
› CBO
› In-memory caching solutions atop HDFS
• RAMfs a la Tachyon?
Hadoop User Group, 201406181830, Yahoo Sunnyvale
We’re not done yet
29
 SQL compliance
 Scaling up the metastore
performance
 Better BI Tool integration
 Faster transport
› HiveServer2 result-sets
Hadoop User Group, 201406181830, Yahoo Sunnyvale
References
30
 The YDN blog post:
› http://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive-
tez-and-yarn
 Code:
› https://github.com/mythrocks/hivebench (TPC-h scripts, datagen, transcode utils)
› https://github.com/t3rmin4t0r/tpch-gen (Parallel TPC-h gen)
› https://github.com/rxin/TPC-H-Hive (TPC-h scripts for Hive)
› https://issues.apache.org/jira/browse/HIVE-600 (Yuntao’s initial TPC-h JIRA)
Hadoop User Group, 201406181830, Yahoo Sunnyvale
Thank You
@mithunrk
mithun@apache.org
We are hiring!
Reach out to us at
bigdata@yahoo-inc.com.
I’m glad you asked.
Sharky comments
33
 Testing with Shark 0.7.x and Shark 0.8
› Compatible with Hive Metastore 0.9
› 100GB datasets : Admirable performance
› 1TB/10TB: Tests did not run completely
• Failures, especially in 10TB cases
• Hangs while shuffling data
• Scaled back to 100 nodes -> More tests ran through, but not completely
› nReducers: Not inferred
 Miscellany
› Security
› Multi-tenancy
› Compatibility
Hadoop User Group, 201406181830, Yahoo Sunnyvale

Contenu connexe

Similaire à June 2014 HUG : Hive On Tez - Benchmarked at Yahoo Scale

Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleDataWorks Summit
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on HadoopDataWorks Summit
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopTony Ng
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 
Risk Analysis in the Financial Services Industry
Risk Analysis in the Financial Services IndustryRisk Analysis in the Financial Services Industry
Risk Analysis in the Financial Services IndustryRevolution Analytics
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesDataWorks Summit
 
Hadoop Webinar 28July15
Hadoop Webinar 28July15Hadoop Webinar 28July15
Hadoop Webinar 28July15Edureka!
 
Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?Edureka!
 
Apache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big DataApache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big DataLuke Han
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleSean Chittenden
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IEdureka!
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQLYu Ishikawa
 
Blueprint Series: Expedia Partner Solutions, Data Platform
Blueprint Series: Expedia Partner Solutions, Data PlatformBlueprint Series: Expedia Partner Solutions, Data Platform
Blueprint Series: Expedia Partner Solutions, Data PlatformMatt Stubbs
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!Cloudera, Inc.
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009yhadoop
 
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...VoltDB
 
Applying linear regression and predictive analytics
Applying linear regression and predictive analyticsApplying linear regression and predictive analytics
Applying linear regression and predictive analyticsMariaDB plc
 
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Charles Allen
 

Similaire à June 2014 HUG : Hive On Tez - Benchmarked at Yahoo Scale (20)

Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Risk Analysis in the Financial Services Industry
Risk Analysis in the Financial Services IndustryRisk Analysis in the Financial Services Industry
Risk Analysis in the Financial Services Industry
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Hadoop Webinar 28July15
Hadoop Webinar 28July15Hadoop Webinar 28July15
Hadoop Webinar 28July15
 
Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?
 
Apache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big DataApache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big Data
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -I
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL
 
Blueprint Series: Expedia Partner Solutions, Data Platform
Blueprint Series: Expedia Partner Solutions, Data PlatformBlueprint Series: Expedia Partner Solutions, Data Platform
Blueprint Series: Expedia Partner Solutions, Data Platform
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009
 
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
 
Applying linear regression and predictive analytics
Applying linear regression and predictive analyticsApplying linear regression and predictive analytics
Applying linear regression and predictive analytics
 
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
 

Plus de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

Plus de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Dernier

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Dernier (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

June 2014 HUG : Hive On Tez - Benchmarked at Yahoo Scale

  • 1. Benchmarking Hive at Yahoo Scale P R E S E N T E D B Y M i t h u n R a d h a k r i s h n a n ⎪ J u n e 1 8 , 2 0 1 4 H a d o o p U s e r G r o u p
  • 2. About myself 2  HCatalog Committer, Hive contributor › Metastore, Notifications, HCatalog APIs › Integration with Oozie, Data Ingestion  Other odds and ends › DistCp  mithun@apache.org Hadoop User Group, 201406181830, Yahoo Sunnyvale
  • 3. About this talk 3  Introduction to “Yahoo Scale”  The use-case in Yahoo  The Benchmark  The Setup  The Observations (and, possibly, lessons)  Fisticuffs Hadoop User Group, 201406181830, Yahoo Sunnyvale
  • 4. The Y!Grid 4  16 Hadoop Clusters in YGrid › 32500 Nodes › 750K jobs a day  Hadoop 0.23.10.x, 2.4.x  Large Datasets › Daily, hourly, minute-level frequencies › Terabytes of data, 1000s of files, per dataset instance  Pig 0.11  Hive 0.10 / HCatalog 0.5 › => Hive 0.12 Hadoop User Group, 201406181830, Yahoo Sunnyvale
  • 5. Data Processing Use cases 5 Hadoop User Group, 201406181830, Yahoo Sunnyvale  Pig for Data Pipelines › Imperative paradigm › ~45% Hadoop Jobs on Production Clusters • M/R + Oozie = 41%  Hive for Ad hoc queries › SQL › Relatively smaller number of jobs • *Major* Uptick  Use HCatalog for Inter-op
  • 6. 6 Yahoo Confidential & Proprietary Hive is Currently the Fastest Growing Product on the Grid 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 8.0% 9.0% 10.0% 0 5 10 15 20 25 30 Mar-13 Apr-13 May-13 Jun-13 Jul-13 Aug-13 Sep-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Mar-14 Apr-14 May-14 HiveJobs(%ofAllJobs) AllGridJobs(inMillions) All Jobs Hive (% of all jobs) 2.4 million Hive jobs
  • 7. Business Intelligence Tools 7  {Tableau, MicroStrategy, Excel, … }  Challenges: › Security • ACLs, Authentication, Encryption over the wire, Full-disk Encryption › Bandwidth • Transporting results over ODBC › Query Latency • Query execution time • Cost of query “optimizations” • “Bad” queries Hadoop User Group, 201406181830, Yahoo Sunnyvale
  • 8. The Benchmark 8  TPC-h › Industry standard (tpc.org/tpch) › 22 queries › dbgen –s 1000 –S 3 • Parallelizable  Reynold Xin’s excellent work: › https://github.com/rxin › Transliterated queries to suit Hive 0.9 Hadoop User Group, 201406181830, Yahoo Sunnyvale
  • 9. Relational Diagram 9 Hadoop User Group, 201406181830, Yahoo Sunnyvale PARTKEY NAME MFGR BRAND TYPE SIZE CONTAINER COMMENT RETAILPRICE PARTKEY SUPPKEY AVAILQTY SUPPLYCOST COMMENT SUPPKEY NAME ADDRESS NATIONKEY PHONE ACCTBAL COMMENT ORDERKEY PARTKEY SUPPKEY LINENUMBER RETURNFLAG LINESTATUS SHIPDATE COMMITDATE RECEIPTDATE SHIPINSTRUCT SHIPMODE COMMENT CUSTKEY ORDERSTATUS TOTALPRICE ORDERDATE ORDER- PRIORITY SHIP- PRIORITY CLERK COMMENT CUSTKEY NAME ADDRESS PHONE ACCTBAL MKTSEGMENT COMMENT PART (P_) SF*200,000 PARTSUPP (PS_) SF*800,000 LINEITEM (L_) SF*6,000,000 ORDERS (O_) SF*1,500,000 CUSTOMER (C_) SF*150,000 SUPPLIER (S_) SF*10,000 ORDERKEY NATIONKEY EXTENDEDPRICE DISCOUNT TAX QUANTITY NATIONKEY NAME REGIONKEY NATION (N_) 25 COMMENT REGIONKEY NAME COMMENT REGION (R_) 5
  • 10. The Setup 10 › 350 Node cluster • Xeon boxen: 2 Slots with E5530s => 16 CPUs • 24GB memory – NUMA enabled • 6 SATA drives, 2TB, 7200 RPM Seagates • RHEL 6.4 • JRE 1.7 (-d64) • Hadoop 0.23.7+/2.3+, Security turned off • Tez 0.3.x • 128MB HDFS block-size › Downscale tests: 100 Node cluster • hdfs-balancer.sh Hadoop User Group, 201406181830, Yahoo Sunnyvale
  • 11. The Prep 11  Data generation: › Text data: dbgen on MapReduce › Transcode to RCFile and ORC: Hive on MR • insert overwrite table orc_table partition( … ) select * from text_table; › Partitioning: • Only for 1TB, 10TB cases • Perils of dynamic partitioning › ORC File: • 64MB stripes, ZLIB Compression Hadoop User Group, 201406181830, Yahoo Sunnyvale
  • 13. 13 Hadoop User Group, 201406181830, Yahoo Sunnyvale 0 500 1000 1500 2000 2500 q1_pricing_summary_report.hive q2_minimum_cost_supplier.hiveq3_shipping_priority.hive q4_order_priority q5_local_supplier_volume.hive q6_forecast_revenue_change.hiveq7_volume_shipping.hive q8_na onal_market_share.hive q9_product_type_profit.hiveq10_returned_item.hiveq11_important_stock.hive q12_shipping.hive q13_customer_distribu on.hive q14_promo on_effect.hive q15_top_supplier.hive q16_parts_supplier_rela onship.hive q17_small_quan ty_order_revenue.hive q18_large_volume_customer.hive q19_discounted_revenue.hive q20_poten al_part_promo on.hive q21_suppliers_who_kept_orders_waing.hive q22_global_sales_opportunity.hive Time(inseconds) TPC-h 100GB Hive 0.10 (Text) Hive 0.10 RCFile Hive 0.11 ORC Hive 0.13 ORC MR Hive 0.13 ORC Tez
  • 14. 100 GB 14 › 18x speedup over Hive 0.10 (Textfile) • 6-50x › 11.8x speedup over Hive 0.10 (RCFile) • 5-30x › Average query time: 28 seconds • Down from 530 (Hive 0.10 Text) › 85% queries completed in under a minute Hadoop User Group, 201406181830, Yahoo Sunnyvale
  • 15. 15 Hadoop User Group, 201406181830, Yahoo Sunnyvale 0 500 1000 1500 2000 2500 q1_pricing_summary_report.hive q2_minimum_cost_supplier.hive q3_shipping_priority.hive q4_order_priority q5_local_supplier_volume.hive q6_forecast_revenue_change.hive q7_volume_shipping.hive q8_na onal_market_share.hive q9_product_type_profit.hiveq10_returned_item.hive q11_important_stock.hive q12_shipping.hive q13_customer_distribu on.hive q14_promo on_effect.hiveq15_top_supplier.hive q16_parts_supplier_rela onship.hive q17_small_quan ty_order_revenue.hive q18_large_volume_customer.hive q19_discounted_revenue.hive q20_poten al_part_promo on.hive q21_suppliers_who_kept_orders_waing.hive q22_global_sales_opportunity.hive Time(inseconds) TPC-h 1TB Hive 0.10 RC File Hive 0.11 ORC Hive 0.12 ORC Hive 0.13 ORC MR Hive 0.13 ORC Tez
  • 16. 1 TB 16 › 6.2x speedup over Hive 0.10 (RCFile) • Between 2.5-17x › Average query time: 172 seconds • Between 5-947 seconds • Down from 729 seconds (Hive 0.10 RCFile) › 61% queries completed in under 2 minutes › 81% queries completed in under 4 minutes Hadoop User Group, 201406181830, Yahoo Sunnyvale
  • 17. 17 Hadoop User Group, 201406181830, Yahoo Sunnyvale 0 2000 4000 6000 8000 10000 12000 q1_pricing_summary_report.hiveq2_minim um_cost_supplier.hive q3_shipping_priority.hive q4_order_priorityq5_local_supplier_volume.hive q6_forecast_revenue_change.hive q7_volume_shipping.hiveq8_na onal_market_share.hive q9_product_type_profit.hive q10_returned_item.hive q11_im portant_stock.hive q12_shipping.hiveq13_customer_distribu on.hive q14_promo on_effect.hive q15_top_supplier.hive q16_parts_supplier_rela onship.hive q17_small_quan ty_order_revenue.hiveq18_large_volume_customer.hiveq19_discounted_revenue.hive q20_poten al_part_promo on.hive q21_suppliers_who_kept_orders_waing.hive q22_global_sales_opportunity.hive Time(inseconds) TPC-h10TB Hive0.10RCFile Hive0.11ORC Hive0.12ORC Hive0.13ORCMR Hive0.13ORCTez
  • 18. 10 TB 18 › 6.2x speedup over Hive 0.10 (RCFile) • Between 1.6-10x › Average query time: 908 seconds (426 seconds excluding outliers) • Down from 2129 seconds with Hive 0.10 RCFile – (1712 seconds excluding outliers) › 61% queries completed in under 5 minutes › 71% queries completed in under 10 minutes › Q6 still completes in 12 seconds! Hadoop User Group, 201406181830, Yahoo Sunnyvale
  • 19. Explaining the speed-ups 19  Hadoop 2.x, et al.  Tez › (Arbitrary DAG)-based Execution Engine › “Playing the gaps” between M&R • Temporary data and the HDFS › Feedback loop › Smart scheduling › Container re-use › Pipelined job start-up  Hive › Statistics › “Vector-ized” Execution  ORC › PPD Hadoop User Group, 201406181830, Yahoo Sunnyvale
  • 20. 20 Hadoop User Group, 201406181830, Yahoo Sunnyvale 0 100 200 300 400 500 600 700 800 900 1000 q1_pricing_sum mary_report.hive q2_m inim um _cost_supplier.hive q3_shipping_priority.hiveq4_order_priority q5_local_supplier_volume.hive q6_forecast_revenue_change.hive q7_volume_shipping.hive q8_na onal_m arket_share.hive q9_product_type_profit.hive q10_returned_item .hive q11_im portant_stock.hiveq12_shipping.hive q13_customer_distribu on.hive q14_promo on_effect.hive q15_top_supplier.hive q16_parts_supplier_rela onship.hive q17_small_quan ty_order_revenue.hive q18_large_volum e_customer.hive q19_discounted_revenue.hive q20_poten al_part_prom o on.hive q21_suppliers_who_kept_orders_waing.hive q22_global_sales_opportunity.hive Time(inseconds) Vectoriza on Hive 0.13 Tez ORC Hive 0.13 Tez ORC Vec
  • 21. 21 Hadoop User Group, 201406181830, Yahoo Sunnyvale ORC File Layout  Data is composed of multiple streams per column  Index allows for skipping rows (default to every 10,000 rows), keeping position in each stream, and min-max for each column  Footer contains directory of stream locations, and the encoding for each column  Integer columns are serialized using run- length encoding  String columns are serialized using dictionary for column values, and the same run length encoding  Stripe footer is used to find the requested column’s data streams and adjacent stream reads are merged File Footer Postscript Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Stream 2.1 Stream 2.2 Stream 2.3 Stream 2.4
  • 22. 22 Hadoop User Group, 201406181830, Yahoo Sunnyvale ORC Usage CREATE TABLE addresses ( name string, street string, city string, state string, zip int ) STORED AS orc TBLPROPERTIES ("orc.compress"= "ZLIB"); LOCATION ‘/path/to/addresses’; ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT orc SET hive.default.fileformat = orc SET hive.exec.orc.memory.pool = 0.50 (ORC writer is allowed 50% of JVM heap size by default) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde’ INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’ OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'; Key Default Comments orc.compress ZLIB high-level compression (one of NONE, ZLIB, Snappy) orc.compress.size 262,144 (256 KB) number of bytes in each compression chunk orc.stripe.size 67,108,864 (64 MB) number of bytes in each stripe. Each ORC stripe is processed in one map task (try 32 MB to cut down on disk I/O) orc.row.index.stride 10,000 number of rows between index entries (must be >= 1,000). A larger stride-size increases the probability of not being able to skip the stride, for a predicate. orc.create.index true whether to create row indexes. This is for predicate push-down. If data is frequently accessed/filtered on a certain column, then sorting on the column and using index-filters makes column filters work faster
  • 23. 23 Hadoop User Group, 201406181830, Yahoo Sunnyvale 0 100 200 300 400 500 600 700 800 900 1000 q1_pricing_summary_report.hive q2_minimum_cost_supplier.hive q3_shipping_priority.hive q4_order_priority q5_local_supplier_volume.hive q6_forecast_revenue_change.hive q7_volume_shipping.hive q8_na onal_market_share.hive q9_product_type_profit.hive q10_returned_item.hive q11_important_stock.hive q12_shipping.hive q13_customer_distribu on.hive q14_promo on_effect.hiveq15_top_supplier.hive q16_parts_supplier_rela onship.hive q17_small_quan ty_order_revenue.hive q18_large_volume_customer.hive q19_discounted_revenue.hive q20_poten al_part_promo on.hive q21_suppliers_who_kept_orders_waing.hive q22_global_sales_opportunity.hive Time(inseconds) Effects of Compression (1TB) Hive 0.13 Uncompressed ORC Hive 0.13 ZLIB Compressed
  • 24. 24 Hadoop User Group, 201406181830, Yahoo Sunnyvale 0 500 1000 1500 2000 2500 3000 q1_pricing_summary_report.hive q2_minimum_cost_supplier.hive q3_shipping_priority.hive q4_order_priority q5_local_supplier_volume.hive q6_forecast_revenue_change.hive q7_volume_shipping.hive q8_na onal_market_share.hive q9_product_type_profit.hiveq10_returned_item.hive q11_important_stock.hive q12_shipping.hive q13_customer_distribu on.hive q14_promo on_effect.hiveq15_top_supplier.hive q16_parts_supplier_rela onship.hive q17_small_quan ty_order_revenue.hive q18_large_volume_customer.hive q19_discounted_revenue.hive q20_poten al_part_promo on.hive q21_suppliers_who_kept_orders_waing.hive q22_global_sales_opportunity.hive Time(inseconds) Effects of Compression (10TB) Hive 0.13 Uncompressed Hive 0.13 Compressed
  • 25. Configuring ORC 25  set hive.merge.mapredfiles=true  set hive.merge.mapfiles=true  set orc.stripe.size=67,108,864 › Half the HDFS block-size • Tangent: nStripes vs nBlocks • Tangent: DistCp  set orc.compress=??? › Depends on size and distribution › Snappy compression hasn’t been explored  YMMV › Experiment Hadoop User Group, 201406181830, Yahoo Sunnyvale
  • 26. 26 Hadoop User Group, 201406181830, Yahoo Sunnyvale 0 100 200 300 400 500 600 700 800 900 1000 q1_pricing_sum m ary_report.hive q2_m inim um _cost_supplier.hive q3_shipping_priority.hive q4_order_priority q5_local_supplier_volum e.hive q6_forecast_revenue_change.hive q7_volum e_shipping.hive q8_na onal_m arket_share.hive q9_product_type_profit.hive q10_returned_item .hive q11_im portant_stock.hive q12_shipping.hive q13_custom er_distribu on.hive q14_prom o on_effect.hive q15_top_supplier.hive q16_parts_supplier_rela onship.hive q17_sm all_quan ty_order_revenue.hive q18_large_volum e_custom er.hive q19_discounted_revenue.hive q20_poten al_part_prom o on.hive q21_suppliers_w ho_kept_orders_w ai ng.hive q22_global_sales_opportunity.hive Time(inseconds) 100 vs 350 Nodes Hive 0.13 100 Nodes Hive 0.13 350 Nodes
  • 28. Y!Grid sticking with Hive 28  Familiarity › Existing ecosystem  Community  Scale  Multitenant  Coming down the pike › CBO › In-memory caching solutions atop HDFS • RAMfs a la Tachyon? Hadoop User Group, 201406181830, Yahoo Sunnyvale
  • 29. We’re not done yet 29  SQL compliance  Scaling up the metastore performance  Better BI Tool integration  Faster transport › HiveServer2 result-sets Hadoop User Group, 201406181830, Yahoo Sunnyvale
  • 30. References 30  The YDN blog post: › http://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive- tez-and-yarn  Code: › https://github.com/mythrocks/hivebench (TPC-h scripts, datagen, transcode utils) › https://github.com/t3rmin4t0r/tpch-gen (Parallel TPC-h gen) › https://github.com/rxin/TPC-H-Hive (TPC-h scripts for Hive) › https://issues.apache.org/jira/browse/HIVE-600 (Yuntao’s initial TPC-h JIRA) Hadoop User Group, 201406181830, Yahoo Sunnyvale
  • 31. Thank You @mithunrk mithun@apache.org We are hiring! Reach out to us at bigdata@yahoo-inc.com.
  • 32. I’m glad you asked.
  • 33. Sharky comments 33  Testing with Shark 0.7.x and Shark 0.8 › Compatible with Hive Metastore 0.9 › 100GB datasets : Admirable performance › 1TB/10TB: Tests did not run completely • Failures, especially in 10TB cases • Hangs while shuffling data • Scaled back to 100 nodes -> More tests ran through, but not completely › nReducers: Not inferred  Miscellany › Security › Multi-tenancy › Compatibility Hadoop User Group, 201406181830, Yahoo Sunnyvale

Notes de l'éditeur

  1. Gopal was supposed to be presenting this with me, to talk about Tez. Point to Gopal/Jitendra’s talk on Hive/Tez for details on things I’ll have to skim over. Also, acknowledge Thomas Graves, who’s talking today about the excellent work he’s doing on driving Spark on Yarn.
  2. There are several sides to query latency: Query execution time : Addressed in the physical query-execution layer. Query optimizations: The first step while optimizing the query plan seems to be to query for all partition instances. Very expensive for “Project Benzene”. Bad queries : Tableau, I’m looking at you.
  3. The Transaction Processing Performance Council (inexplicably abbreviated to TPC) suggests a set of benchmarks for query processing. Many have adopted TPC-DS to showcase performance. We chose TPC-h to complement. (Also, 22 much smaller number to deal with than… 90?) Transliteration: Evita and Kylie Minogue
  4. Lineitem and Orders are extremely large Fact tables. Nation and Region are the smallest dimension tables.
  5. Tangent: Funny story: 1. About hard-drives: Can set up MR intermediate directories and HDFS data-node directories to be on different disks. Traffic from one doesn’t affect the other. But on the other hand, total read bandwidth might be reduced.
  6. Line-item: Partitioned on Ship-date. Orders: Order-date Customers: By market-segment Suppliers: On their region-key.
  7. Q5 and q21 are anomalous. Q21: Hit a trailing reducer across all versions of Hive tested. Perhaps this can be improved with a better plan. Q5: Slow reducer that hit only Hive 13. Could be a bad plan. Could be a difference in data distribution when data was regenerated for Hadoop 2 cluster.
  8. Tez : Scheduling. Playing the gaps, like Beethoven’s Fifth.
  9. Vectorization: On average: 1.2x.
  10. Except for a few outliers, ZLIB compression actually reduced performance for a 1TB dataset. Uncompressed was 1.3x faster than Compressed.
  11. The situation reverses at the 10 TB level. The gains from decompression are actually offset by the disk-read time. The long-tail in 10TB/q21 threw the scale of the graph off, so I’ve excluded it in the results.
  12. Talk about file-coalesce, small-file generation, Namenode pressure and parallelism. You don’t want to read an ORC stripe from a different node. Talk about distcp –pgrub, for ORC files. Mention that SNAPPY’s license is not Apache. Also, Yoda.
  13. At 100 nodes, it performs at 0.9x the 350 node performance.
  14. We’ve seen Hive and Tez scale down for latency, scale up for data-size, and scale out across larger clusters.
  15. Familiarity : We have an existing ecosystem with Hive, HCatalog, Pig and Oozie that delivers revenue to Yahoo today. It’s hard to rock the boat. Community: The Apache Hive community is large, active and thriving. They’ve been solving issues with query latency for ages now. The switch to using the Tez execution engine was a solution within the Apache Hive project. This wasn’t a fork of Hive. This is Hive, proper. Scale: We’ve seen Hive and Tez perform at scale. Heck, we’ve seen Pig perform on Tez. Multitenant: Yahoo’s use-case is unique, and not just because of data-scale. There’s hundreds of active users and genuine multitenancy and security concerns. Design: We think the Hive community has tackled the right problems first, rather than throw RAM at the problem.
  16. Bucky Lasek at the X-Games in 2001. Notice where he’s looking… Not at the camera, but setting up his next trick.
  17. Security: Kerberos support was patched in, after the benchmarks were run. Multi-tenancy: Data needs to be explicitly pinned into memory as RDDs. In a multi-tenant system, how would pinning work? Eviction policy for data. Compatibility: Needs to work with Metastore versions 12 and 13. Shark’s gone to 0.11 just recently. Integration with the rest of the stack: Oozie and Pig. Overall, we wanted a solution that works with high-dynamic range. i.e. works well with small datasets (100s of GBs), as well as scale to multi-terabyte datasets. We have a familiar system that seems to fit that bill. It doesn’t quite rock the boat. It’s not perfect yet. There are bugs that we’re working on. And we still haven’t solved the problem of data-volume/BI. By the way, I really like the idea of BlinkDB. I saw the JIRA.