SlideShare une entreprise Scribd logo
1  sur  40
Copyright © 2013 Cloudera Inc. All rights reserved.
Headline Goes Here
Speaker Name or Subhead Goes Here
Hadoop Beyond Batch: 

Real-time Workloads, SQL-on-
Hadoop, and the Virtual EDW
Marcel Kornacker | marcel@cloudera.com 
April 2014
Copyright © 2013 Cloudera Inc. All rights reserved.
Analytic Workloads on Hadoop: Where Do
We Stand?
!2
“DeWitt Clause” prohibits
using DBMS vendor name
Copyright © 2013 Cloudera Inc. All rights reserved.
Hadoop for Analytic Workloads
•Hadoop has traditional been utilized for offline batch processing:
ETL and ELT
•Next step: Hadoop for traditional business intelligence (BI)/data
warehouse (EDW) workloads:
•interactive
•concurrent users
•Topic of this talk: a Hadoop-based open-source stack for EDW
workloads:
•HDFS: a high-performance storage system
•Parquet: a state-of-the-art columnar storage format
•Impala: a modern, open-source SQL engine for Hadoop
!3
Copyright © 2013 Cloudera Inc. All rights reserved.
Hadoop for Analytic Workloads
•Thesis of this talk:
•techniques and functionality of established commercial
solutions are either already available or are rapidly being
implemented in Hadoop stack
•Hadoop stack is effective solution for certain EDW workloads
•Hadoop-based EDW solution maintains Hadoop’s strengths:
flexibility, ease of scaling, cost effectiveness
!4
Copyright © 2013 Cloudera Inc. All rights reserved.
HDFS: A Storage System for Analytic
Workloads
•Available in Hdfs today:
•high-efficiency data scans at or near hardware speed, both
from disk and memory
•On the immediate roadmap:
•co-partitioned tables for even faster distributed joins
•temp-FS: write temp table data straight to memory,
bypassing disk

!5
Copyright © 2013 Cloudera Inc. All rights reserved.
HDFS: The Details
•High efficiency data transfers
•short-circuit reads: bypass DataNode protocol when reading
from local disk

-> read at 100+MB/s per disk
•HDFS caching: access explicitly cached data w/o copy or
checksumming

-> access memory-resident data at memory bus speed

-> enable in-memory processing
!6
Copyright © 2013 Cloudera Inc. All rights reserved.
HDFS: The Details
•Coming attractions:
•affinity groups: collocate blocks from different files

-> create co-partitioned tables for improved join
performance
•temp-fs: write temp table data straight to memory,
bypassing disk

-> ideal for iterative interactive data analysis
!7
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: Columnar Storage for Hadoop
•What it is:
•state-of-the-art, open-source columnar file format that’s
available for (most) Hadoop processing frameworks:

Impala, Hive, Pig, MapReduce, Cascading, …
•offers both high compression and high scan efficiency
•co-developed by Twitter and Cloudera; hosted on github and
soon to be an Apache incubator project
•with contributors from Criteo, Stripe, Berkeley AMPlab,
LinkedIn
•used in production at Twitter and Criteo
!8
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: The Details
•columnar storage: column-major instead of the traditional
row-major layout; used by all high-end analytic DBMSs
•optimized storage of nested data structures: patterned
after Dremel’s ColumnIO format
•extensible set of column encodings:
•run-length and dictionary encodings in current version (1.2)
•delta and optimized string encodings in 2.0
•embedded statistics: version 2.0 stores inlined column
statistics for further optimization of scan efficiency
!9
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: Storage Efficiency
!10
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: Scan Efficiency
!11
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala: A Modern, Open-Source SQL Engine
•implementation of an MPP SQL query engine for the Hadoop
environment
•highest-performance SQL engine for the Hadoop ecosystem;

already outperforms some of its commercial competitors
•effective for EDW-style workloads
•maintains Hadoop flexibility by utilizing standard Hadoop
components (HDFS, Hbase, Metastore, Yarn)
•plays well with traditional BI tools:

exposes/interacts with industry-standard interfaces (odbc/
jdbc, Kerberos and LDAP, ANSI SQL)
!12
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala: A Modern, Open-Source SQL Engine
•history:
•developed by Cloudera and fully open-source; hosted on
github
•released as beta in 10/2012
•1.0 version available in 05/2013
•current version is 1.2.3, available for CDH4 and CDH5 beta
!13
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala from The User’s Perspective
•create tables as virtual views over data stored in HDFS
or Hbase;

schema metadata is stored in Metastore (shared with
Hive, Pig, etc.; basis of HCatalog)
•connect via odbc/jdbc; authenticate via Kerberos or
LDAP
•run standard SQL:
•current version: ANSI SQL-92 (limited to SELECT and bulk
insert) minus correlated subqueries, has UDFs and UDAs
!14
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala from The User’s Perspective
•2014 roadmap:
•1.3: admission control, Order By without Limit,
Decimal(<precision>, <scale>)
•1.4: analytic window functions
•2.0: support for nested types (structs, arrays, maps), UDTFs,
disk-based joins and aggregation
!15
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture
•distributed service:
•daemon process (impalad) runs on every node with data
•easily deployed with Cloudera Manager
•each node can handle user requests; load balancer
configuration for multi-user environments recommended
•query execution phases:
•client request arrives via odbc/jdbc
•planner turns request into collection of plan fragments
•coordinator initiates execution on remote impala’s
!16
Copyright © 2013 Cloudera Inc. All rights reserved.
• Request arrives via odbc/jdbc
Impala Query Execution
!17
Copyright © 2013 Cloudera Inc. All rights reserved.
• Planner turns request into collection of plan fragments
• Coordinator initiates execution on remote impalad nodes
Impala Query Execution
!18
Copyright © 2013 Cloudera Inc. All rights reserved.
• Intermediate results are streamed between impala’s
• Query results are streamed back to client
Impala Query Execution
!19
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Planning
•2-phase process:
•single-node plan: left-deep tree of query operators
•partitioning into plan fragments for distributed parallel
execution:

maximize scan locality/minimize data movement, parallelize
all query operators
•cost-based join order optimization
•cost-based join distribution optimization
!20
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Execution
•execution engine designed for efficiency, written from scratch
in C++; no reuse of decades-old open-source code
•circumvents MapReduce completely
•in-memory execution:
•aggregation results and right-hand side inputs of joins are
cached in memory
•example: join with 1TB table, reference 2 of 200 cols, 10% of
rows 

-> need to cache 1GB across all nodes in cluster

-> not a limitation for most workloads
!21
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Execution
•runtime code generation:
•uses llvm to jit-compile the runtime-intensive parts of a
query
•effect the same as custom-coding a query:
•remove branches
•propagate constants, offsets, pointers, etc.
•inline function calls
•optimized execution for modern CPUs (instruction pipelines)
!22
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Execution
!23
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs MR for Analytic Workloads
•Impala vs. SQL-on-MR
•Impala 1.1.1/Hive 0.12 (“Stinger Phases 1 and 2”)
•file formats: Parquet/ORCfile
•TPC-DS, 3TB data set running on 5-node cluster
!24
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs MR for Analytic Workloads
• Impala speedup:
• interactive: 8-69x
• report: 6-68x
• deep analytics:
10-58x
!25
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs non-MR for Analytic Workloads
•Impala 1.2.3/Presto 0.6/Shark
•file formats: RCfile (+ Parquet)
•TPC-DS, 15TB data set running on 21-node cluster
!26
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs non-MR for Analytic Workloads
!27
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs non-MR for Analytic Workloads
!28
• Multi-user benchmark:
• 10 users concurrently
• same dataset, same
hardware
• workload: queries from
“interactive” group
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs non-MR for Analytic Workloads
!29
Copyright © 2013 Cloudera Inc. All rights reserved.
Scalability in Hadoop
•Hadoop’s promise of linear scalability: add more
nodes to cluster, gain a proportional increase in
capabilities

-> adapt to any kind of workload changes simply by
adding more nodes to cluster
•scaling dimensions for EDW workloads:
•response time
•concurrency/query throughput
•data size
!30
Copyright © 2013 Cloudera Inc. All rights reserved.
Scalability in Hadoop
•Scalability results for Impala:
•tests show linear scaling along all 3 dimensions
•setup:
•2 clusters: 18 and 36 nodes
•15TB TPC-DS data set
•6 “interactive” TPC-DS queries
!31
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Scalability: Latency
!32
Copyright © 2013 Cloudera Inc. All rights reserved.
• Comparison: 10 vs 20 concurrent users
Impala Scalability: Concurrency
!33
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Scalability: Data Size
• Comparison: 15TB vs. 30TB data set
!34
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
•Thesis of this talk:
•techniques and functionality of established commercial
solutions are either already available or are rapidly being
implemented in Hadoop stack
•Impala/Parquet/Hdfs is effective solution for certain EDW
workloads
•Hadoop-based EDW solution maintains Hadoop’s strengths:
flexibility, ease of scaling, cost effectiveness
!35
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
•latest technological innovations add capabilities that
originated in high-end proprietary systems:
•high-performance disk scans and memory caching in HDFS
•Parquet: columnar storage for analytic workloads
•Impala: high-performance parallel SQL execution
!36
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
•Impala/Parquet/Hdfs for EDW workloads:
•integrates into BI environment via standard connectivity and
security
•comparable or better performance than commercial
competitors
•currently still SQL limitations
•but those are rapidly diminishing
!37
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
•Impala/Parquet/Hdfs maintains traditional Hadoop
strengths:
•flexibility: Parquet is understood across the platform, natively
processed by most popular frameworks
•demonstrated scalability and cost effectiveness
!38
The End
!39
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
•what the future holds:
•further performance gains
•more complete SQL capabilities
•improved resource mgmt and ability to handle multiple
concurrent workloads in a single cluster
!40

Contenu connexe

Tendances

Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaData Science London
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARNWangda Tan
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Scott Leberknight
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldDataWorks Summit
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera, Inc.
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Hortonworks
 

Tendances (20)

Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
 
Empower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and HadoopEmpower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and Hadoop
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
 

En vedette

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLCloudera, Inc.
 
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about..."Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...Kai Wähner
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 
[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba
[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba
[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten IchibaRakuten Group, Inc.
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search TrainingCloudera, Inc.
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseGwen (Chen) Shapira
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaLucidworks
 
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsWhat Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsCloudera, Inc.
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data WarehouseCaserta
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchCloudera, Inc.
 
Design in Tech Report 2017
Design in Tech Report 2017Design in Tech Report 2017
Design in Tech Report 2017John Maeda
 

En vedette (14)

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
 
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about..."Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba
[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba
[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search Training
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle Database
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
 
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsWhat Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
 
Design in Tech Report 2017
Design in Tech Report 2017Design in Tech Report 2017
Design in Tech Report 2017
 

Similaire à Real-time SQL-on-Hadoop, Parquet and Impala for EDW Workloads

Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisFelicia Haggarty
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeNicolas Morales
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Timothy Spann
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyHakka Labs
 

Similaire à Real-time SQL-on-Hadoop, Parquet and Impala for EDW Workloads (20)

Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
 
Introducing Kudu
Introducing KuduIntroducing Kudu
Introducing Kudu
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
 

Plus de Swiss Big Data User Group

Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useSwiss Big Data User Group
 
A real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorA real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorSwiss Big Data User Group
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisSwiss Big Data User Group
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesSwiss Big Data User Group
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningSwiss Big Data User Group
 
Unleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseUnleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseSwiss Big Data User Group
 
Project "Babelfish" - A data warehouse to attack complexity
 Project "Babelfish" - A data warehouse to attack complexity Project "Babelfish" - A data warehouse to attack complexity
Project "Babelfish" - A data warehouse to attack complexitySwiss Big Data User Group
 
Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceSwiss Big Data User Group
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketSwiss Big Data User Group
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridSwiss Big Data User Group
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseSwiss Big Data User Group
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computingSwiss Big Data User Group
 

Plus de Swiss Big Data User Group (20)

Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to use
 
A real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorA real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operator
 
Data Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2CData Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2C
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data Analysis
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companies
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Educating Data Scientists of the Future
Educating Data Scientists of the FutureEducating Data Scientists of the Future
Educating Data Scientists of the Future
 
Unleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseUnleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data Warehouse
 
Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?
 
Project "Babelfish" - A data warehouse to attack complexity
 Project "Babelfish" - A data warehouse to attack complexity Project "Babelfish" - A data warehouse to attack complexity
Project "Babelfish" - A data warehouse to attack complexity
 
Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density Choice
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maket
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC Datagrid
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph database
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computing
 
In-Store Analysis with Hadoop
In-Store Analysis with HadoopIn-Store Analysis with Hadoop
In-Store Analysis with Hadoop
 
Big Data Visualization With ParaView
Big Data Visualization With ParaViewBig Data Visualization With ParaView
Big Data Visualization With ParaView
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 
Oracle's BigData solutions
Oracle's BigData solutionsOracle's BigData solutions
Oracle's BigData solutions
 

Dernier

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Dernier (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Real-time SQL-on-Hadoop, Parquet and Impala for EDW Workloads

  • 1. Copyright © 2013 Cloudera Inc. All rights reserved. Headline Goes Here Speaker Name or Subhead Goes Here Hadoop Beyond Batch: 
 Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Marcel Kornacker | marcel@cloudera.com April 2014
  • 2. Copyright © 2013 Cloudera Inc. All rights reserved. Analytic Workloads on Hadoop: Where Do We Stand? !2 “DeWitt Clause” prohibits using DBMS vendor name
  • 3. Copyright © 2013 Cloudera Inc. All rights reserved. Hadoop for Analytic Workloads •Hadoop has traditional been utilized for offline batch processing: ETL and ELT •Next step: Hadoop for traditional business intelligence (BI)/data warehouse (EDW) workloads: •interactive •concurrent users •Topic of this talk: a Hadoop-based open-source stack for EDW workloads: •HDFS: a high-performance storage system •Parquet: a state-of-the-art columnar storage format •Impala: a modern, open-source SQL engine for Hadoop !3
  • 4. Copyright © 2013 Cloudera Inc. All rights reserved. Hadoop for Analytic Workloads •Thesis of this talk: •techniques and functionality of established commercial solutions are either already available or are rapidly being implemented in Hadoop stack •Hadoop stack is effective solution for certain EDW workloads •Hadoop-based EDW solution maintains Hadoop’s strengths: flexibility, ease of scaling, cost effectiveness !4
  • 5. Copyright © 2013 Cloudera Inc. All rights reserved. HDFS: A Storage System for Analytic Workloads •Available in Hdfs today: •high-efficiency data scans at or near hardware speed, both from disk and memory •On the immediate roadmap: •co-partitioned tables for even faster distributed joins •temp-FS: write temp table data straight to memory, bypassing disk
 !5
  • 6. Copyright © 2013 Cloudera Inc. All rights reserved. HDFS: The Details •High efficiency data transfers •short-circuit reads: bypass DataNode protocol when reading from local disk
 -> read at 100+MB/s per disk •HDFS caching: access explicitly cached data w/o copy or checksumming
 -> access memory-resident data at memory bus speed
 -> enable in-memory processing !6
  • 7. Copyright © 2013 Cloudera Inc. All rights reserved. HDFS: The Details •Coming attractions: •affinity groups: collocate blocks from different files
 -> create co-partitioned tables for improved join performance •temp-fs: write temp table data straight to memory, bypassing disk
 -> ideal for iterative interactive data analysis !7
  • 8. Copyright © 2013 Cloudera Inc. All rights reserved. Parquet: Columnar Storage for Hadoop •What it is: •state-of-the-art, open-source columnar file format that’s available for (most) Hadoop processing frameworks:
 Impala, Hive, Pig, MapReduce, Cascading, … •offers both high compression and high scan efficiency •co-developed by Twitter and Cloudera; hosted on github and soon to be an Apache incubator project •with contributors from Criteo, Stripe, Berkeley AMPlab, LinkedIn •used in production at Twitter and Criteo !8
  • 9. Copyright © 2013 Cloudera Inc. All rights reserved. Parquet: The Details •columnar storage: column-major instead of the traditional row-major layout; used by all high-end analytic DBMSs •optimized storage of nested data structures: patterned after Dremel’s ColumnIO format •extensible set of column encodings: •run-length and dictionary encodings in current version (1.2) •delta and optimized string encodings in 2.0 •embedded statistics: version 2.0 stores inlined column statistics for further optimization of scan efficiency !9
  • 10. Copyright © 2013 Cloudera Inc. All rights reserved. Parquet: Storage Efficiency !10
  • 11. Copyright © 2013 Cloudera Inc. All rights reserved. Parquet: Scan Efficiency !11
  • 12. Copyright © 2013 Cloudera Inc. All rights reserved. Impala: A Modern, Open-Source SQL Engine •implementation of an MPP SQL query engine for the Hadoop environment •highest-performance SQL engine for the Hadoop ecosystem;
 already outperforms some of its commercial competitors •effective for EDW-style workloads •maintains Hadoop flexibility by utilizing standard Hadoop components (HDFS, Hbase, Metastore, Yarn) •plays well with traditional BI tools:
 exposes/interacts with industry-standard interfaces (odbc/ jdbc, Kerberos and LDAP, ANSI SQL) !12
  • 13. Copyright © 2013 Cloudera Inc. All rights reserved. Impala: A Modern, Open-Source SQL Engine •history: •developed by Cloudera and fully open-source; hosted on github •released as beta in 10/2012 •1.0 version available in 05/2013 •current version is 1.2.3, available for CDH4 and CDH5 beta !13
  • 14. Copyright © 2013 Cloudera Inc. All rights reserved. Impala from The User’s Perspective •create tables as virtual views over data stored in HDFS or Hbase;
 schema metadata is stored in Metastore (shared with Hive, Pig, etc.; basis of HCatalog) •connect via odbc/jdbc; authenticate via Kerberos or LDAP •run standard SQL: •current version: ANSI SQL-92 (limited to SELECT and bulk insert) minus correlated subqueries, has UDFs and UDAs !14
  • 15. Copyright © 2013 Cloudera Inc. All rights reserved. Impala from The User’s Perspective •2014 roadmap: •1.3: admission control, Order By without Limit, Decimal(<precision>, <scale>) •1.4: analytic window functions •2.0: support for nested types (structs, arrays, maps), UDTFs, disk-based joins and aggregation !15
  • 16. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture •distributed service: •daemon process (impalad) runs on every node with data •easily deployed with Cloudera Manager •each node can handle user requests; load balancer configuration for multi-user environments recommended •query execution phases: •client request arrives via odbc/jdbc •planner turns request into collection of plan fragments •coordinator initiates execution on remote impala’s !16
  • 17. Copyright © 2013 Cloudera Inc. All rights reserved. • Request arrives via odbc/jdbc Impala Query Execution !17
  • 18. Copyright © 2013 Cloudera Inc. All rights reserved. • Planner turns request into collection of plan fragments • Coordinator initiates execution on remote impalad nodes Impala Query Execution !18
  • 19. Copyright © 2013 Cloudera Inc. All rights reserved. • Intermediate results are streamed between impala’s • Query results are streamed back to client Impala Query Execution !19
  • 20. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture: Query Planning •2-phase process: •single-node plan: left-deep tree of query operators •partitioning into plan fragments for distributed parallel execution:
 maximize scan locality/minimize data movement, parallelize all query operators •cost-based join order optimization •cost-based join distribution optimization !20
  • 21. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture: Query Execution •execution engine designed for efficiency, written from scratch in C++; no reuse of decades-old open-source code •circumvents MapReduce completely •in-memory execution: •aggregation results and right-hand side inputs of joins are cached in memory •example: join with 1TB table, reference 2 of 200 cols, 10% of rows 
 -> need to cache 1GB across all nodes in cluster
 -> not a limitation for most workloads !21
  • 22. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture: Query Execution •runtime code generation: •uses llvm to jit-compile the runtime-intensive parts of a query •effect the same as custom-coding a query: •remove branches •propagate constants, offsets, pointers, etc. •inline function calls •optimized execution for modern CPUs (instruction pipelines) !22
  • 23. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture: Query Execution !23
  • 24. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs MR for Analytic Workloads •Impala vs. SQL-on-MR •Impala 1.1.1/Hive 0.12 (“Stinger Phases 1 and 2”) •file formats: Parquet/ORCfile •TPC-DS, 3TB data set running on 5-node cluster !24
  • 25. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs MR for Analytic Workloads • Impala speedup: • interactive: 8-69x • report: 6-68x • deep analytics: 10-58x !25
  • 26. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs non-MR for Analytic Workloads •Impala 1.2.3/Presto 0.6/Shark •file formats: RCfile (+ Parquet) •TPC-DS, 15TB data set running on 21-node cluster !26
  • 27. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs non-MR for Analytic Workloads !27
  • 28. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs non-MR for Analytic Workloads !28 • Multi-user benchmark: • 10 users concurrently • same dataset, same hardware • workload: queries from “interactive” group
  • 29. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs non-MR for Analytic Workloads !29
  • 30. Copyright © 2013 Cloudera Inc. All rights reserved. Scalability in Hadoop •Hadoop’s promise of linear scalability: add more nodes to cluster, gain a proportional increase in capabilities
 -> adapt to any kind of workload changes simply by adding more nodes to cluster •scaling dimensions for EDW workloads: •response time •concurrency/query throughput •data size !30
  • 31. Copyright © 2013 Cloudera Inc. All rights reserved. Scalability in Hadoop •Scalability results for Impala: •tests show linear scaling along all 3 dimensions •setup: •2 clusters: 18 and 36 nodes •15TB TPC-DS data set •6 “interactive” TPC-DS queries !31
  • 32. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Scalability: Latency !32
  • 33. Copyright © 2013 Cloudera Inc. All rights reserved. • Comparison: 10 vs 20 concurrent users Impala Scalability: Concurrency !33
  • 34. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Scalability: Data Size • Comparison: 15TB vs. 30TB data set !34
  • 35. Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads •Thesis of this talk: •techniques and functionality of established commercial solutions are either already available or are rapidly being implemented in Hadoop stack •Impala/Parquet/Hdfs is effective solution for certain EDW workloads •Hadoop-based EDW solution maintains Hadoop’s strengths: flexibility, ease of scaling, cost effectiveness !35
  • 36. Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads •latest technological innovations add capabilities that originated in high-end proprietary systems: •high-performance disk scans and memory caching in HDFS •Parquet: columnar storage for analytic workloads •Impala: high-performance parallel SQL execution !36
  • 37. Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads •Impala/Parquet/Hdfs for EDW workloads: •integrates into BI environment via standard connectivity and security •comparable or better performance than commercial competitors •currently still SQL limitations •but those are rapidly diminishing !37
  • 38. Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads •Impala/Parquet/Hdfs maintains traditional Hadoop strengths: •flexibility: Parquet is understood across the platform, natively processed by most popular frameworks •demonstrated scalability and cost effectiveness !38
  • 40. Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads •what the future holds: •further performance gains •more complete SQL capabilities •improved resource mgmt and ability to handle multiple concurrent workloads in a single cluster !40