SlideShare une entreprise Scribd logo
1  sur  26
HUAWEI TECHNOLOGIES CO., LTD.
Spark SQL on HBASE
Spark Meet-up Tech Talks
Yan Zhou/Bing Xiao
March 25, 2015
HUAWEI TECHNOLOGIES Co., Ltd.
 Lead architect of Huawei Big Data Platform
 Apache Pig committer
 ex-Hadooper @ Yahoo!
 15+ years of experience in DB, OLAP, distributed computing fields.
About me
Page ‹2›
HUAWEI TECHNOLOGIES Co., Ltd.
 A Fortune Global 500, Private Company
 Annual growth at 15% (revenue up to $46 Billion in 2014)
 More than $6.5 Billion investment on R&D
 Transition from telecom-equipment manufacturer to a leader of ICT
(information and communications technology)
 Big data and open source are part of company-wide strategies
About Huawei
Page ‹3›
HUAWEI TECHNOLOGIES Co., Ltd.
 Spark SQL on HBase
o Motivations
o Data Model
o SQL Semantics
o Performance
o Value Proposition
 Demo
 Roadmap
 Q/A
Agenda
Page ‹4›
HUAWEI TECHNOLOGIES Co., Ltd.
 Driven by use cases in verticals
including telco
 Telco Data is unique & complex
 Flexible data organization for various
types of queries: range, ad-hoc,
interactive, DW
 Stay in tune: planned sessions in
future events
Customer & Billing Data
• Well Structured
• TB
Session Signaling Data:
• Multi-Device generated
• Hundreds TB-PB
• Real-Time Biz Oriented
MR/CHR Data:
• Semi-structured & Nested
• PB
• Location-centered Data
xDR Data:
• 1~10PB each month
Network Raw Data
• Unstructured
• TB/Sec
• Linear Growth with
Biz
CRM
Billing
Signaling Data
MR/CHR Type Data
xDR Data
Network Raw Data
Note:In a typical network of ~30M subscribers and the data flow around 1TB/Sec.
Page ‹5›
Motivations
HUAWEI TECHNOLOGIES Co., Ltd.
 NOSQL Key-Value data store on Hadoop
 Following Google BigTable model
 Emerging platform for scale-out relational data stores on Hadoop:
o Splice Machines
o Trafodion (HP)
o Apache Phoenix (Salesforce)
o Kylin (eBay)
 M/R and API-based data access interfaces only
Page ‹6›
What is HBase?
HUAWEI TECHNOLOGIES Co., Ltd.
Spark Core
Spark SQL
Existing HBase Access Path
HadoopRDD
TableInput/OutputFormat
Features: Hadoop M/R plug-in
Inflexible/Hard to use
Limited pushdown capabilities
High Latency
HiveContext
metastore
Page ‹7›
HBase
HUAWEI TECHNOLOGIES Co., Ltd.
New Data Access Path
metadata
Featuring:
• Fully distributed processing engine for scalability
and fault tolerance
• Scala/Java/Python APIs
• Pluggable data source to Spark SQL through
Spark SQL API
• Enable systematic and powerful handling of
pushdowns (key range, filters, coprocessor)
• More SQL capabilities made possible (Primary Key,
Update, INSERT INTO … VALUES, 2nd index, Bloom Filter, …)
Page ‹8›
Spark SQL
Spark Core
HBase
HUAWEI TECHNOLOGIES Co., Ltd.
 Logical Data Model the same as Spark SQL: relational and type system.
 Physical Data Model:
• Support of composite primary keys
• HBase rowkey of byte representation of composite primary keys
• Logical non-key columns mapped onto <column family, column qualifier>
• Persistent metadata on a special HBase table
• Presplit tables supported
Data Models
Page ‹9›
HUAWEI TECHNOLOGIES Co., Ltd.
 Based on Spark SQL syntax, plus …
 DDL:
• CREATE TABLE table_name (col1 TYPE1, col2 TYPE2, …, PRIMARY KEY (col7, col1, col3))
MAPPED BY (hbase_tablename, COLS=[col2=cf1.cq11, col4=cf1.cq12, col5=cf2.cq21,
col6=cf2.cq22])
•ALTER TABLE table_name ADD/DROP column …
 DML:
• INSERT … INTO VALUES …
 Bulk Loading:
• LOAD DATA [PARALLEL] INPATH filePath [OVERWRITE] INTO TABLE tableName [FIELDS
TERMINATED BY char]
SQL Semantics
Page ‹10›
HUAWEI TECHNOLOGIES Co., Ltd.
 Precise partition pruning and partition-specific multidimensional predicate
pushdowns based on partial evaluation of filter boolean expressions for queries
Query Optimization Approach
=> Itemid > 300 AND amount < 30
=> customer=‘John’ AND itemid < 100 AND amount > 200
E.g. a sales table with <customer, itemid> as a 2-dimensional primary key
SELECT * from sales WHERE ((customer=‘Joe’ AND itemid > 300 AND amount < 30) OR
(customer=‘John’ AND itemid < 100) AND amount > 200)
The existing partitions/regions are:
1. (, ‘Ashley’)
2. [‘Ashley’, “Iris”)
3. [(‘Joe’, 10), (‘Joe’, 100)),
4. [(‘Joe’, 200), (‘Joe’, 1000))
5. [‘John’, ‘York’)
6. [‘York’, )
 The algorithms are generic and applicable to other organized data sets like
hash-partitioned Hive tables as well.
 Suitable for interactive ad hoc queries
Page ‹11›
for scan range for filtering
HUAWEI TECHNOLOGIES Co., Ltd.
 Queries (TPC-DS, 10M records):
Query Performance
Page ‹12›
SQL Query SparkSQL
on HBase
(Seconds)
Phoenix
(Seconds)
1-key-range select count(1) from store_sales where
(ss_item_sk = 99 and ss_ticket_number > 1000);
0.18 0.03
2-key-range select count(1) from store_sales where
(ss_item_sk = 99 and ss_ticket_number > 1000) or
(ss_item_sk = 5000 and ss_ticket_number < 20000);
0.22 4.29
3-key-range select count(1) from store_sales where
(ss_item_sk = 99 and ss_ticket_number > 1000) or
(ss_item_sk = 5000 and ss_ticket_number < 20000) or
(ss_item_sk = 28000 and ss_ticket_number <= 10000);
0.27 4.44
Aggregate on
the secondary
key
select count(1) from store_sales group by ss_ticket_number; 37 79
• Cluster:
o 1 master + 6 slaves with 48GB/node
o Xeon 2.4G 16 cores
HUAWEI TECHNOLOGIES Co., Ltd.
 Query performance (TPC-DS, 10M records):
Query Performance
0.18 0.22 0.27
37
0.03
4.29 4.44
79
0
10
20
30
40
50
60
70
80
90
1-key-range 2-key-range 3-key-range aggregate on
secondary key
Seconds
Spark SQL on HBase
Phoenix
Page ‹13›
HUAWEI TECHNOLOGIES Co., Ltd.
 Performance optimization for tabular data bulk loading
• late materialization of KeyValue cells
 reduction of shuffle data volume
• removal of sorting by reducers
 lightweight reducer
 more scalable
• best effort to colocate reducers with the region servers
 Optional parallel incremental loading after M/R in the bulk loader
Bulk Load Optimization
Page ‹14›
HUAWEI TECHNOLOGIES Co., Ltd.
 Loading performance (TPC-DS, 10M records):
Bulk Load Performance
557
185
1093
762
0
200
400
600
800
1000
1200
Load (no presplit) Load (6 presplit regions)
Seconds
Spark SQL on HBase
Phoenix
Page ‹15›
• Cluster:
o 1 master + 6 slaves with 48GB/node
o Xeon 2.4G 16 cores
HUAWEI TECHNOLOGIES Co., Ltd.
 Combined capabilities of Spark, SparkSQL and HBase
o Spark Dataframe supported
 More traditional DBMS capabilities made possible on HBase
 Basis to build a highly performing and concurrent distributed big data SQL
system
 Optimized bulk loader for tabular data sets
 Performance excellence
Value Proposition
Page ‹16›
HUAWEI TECHNOLOGIES Co., Ltd.
 Source repo: https://github.com/Huawei-Spark/hbase/
 emails: yan.zhou.sc@huawei.com, bing.xiao@huawei.com
 Team members:
Bo Meng, Xinyun Huang, Wang Fei, Stanley Burnitt, Shijun(Ken) Ma, Jacky
Li, Stephen Boesch
Plus our Big Data Teams in India and Hangzhou, China
Project Info
 Join us …
o Comments, tryouts, and contributions are more than welcome
o Open for Joint development in next phase project(s)
o We are hiring: Big Data Engineers/Spark Fans
Page ‹17›
HUAWEI TECHNOLOGIES Co., Ltd.
 Environment
 Hardware: 9-node (1 master + 8 slaves) blades
 Software: Linux Enterprise Server 11.1
 Data set: TPC-DS 10M records

 Indexed-range query vs. full scan query through use of the DataFrame
 Join between a Hbase table with an in-memory side table through use of Dataframe
 Partial Evaluation
• Schema = (id : Int, age: Int)
• The row to be partially evaluated on is: ([1,5), null)
• Predicate 1: (id < 1) OR (age > 30) => (age > 30)
• Predicate 2: (id < 6) OR (age > 30) => True
Demo
Page ‹18›
HUAWEI TECHNOLOGIES Co., Ltd.
 Targeting Spark 1.4
 Coprocessor/Custom Filter
 Filter/Partitions from CAST values in predicates
 Latency reduction: Spark-3306 for external resource management
 Optimizations of Sorting/Aggregation/Join on primary keys
 Support of salting, timestamp, dynamic columns, nested data type, …
 Views and materialized views
Future Plan
Page ‹19›
HUAWEI TECHNOLOGIES Co., Ltd.
 Earliest Corporate sponsor of AMP Lab & its projects including Spark
 One of Leading contributors to Spark: 10 & 11 contributors in Spark 1.2 & 1.3
releases
 Highlighted contributions of New features : Power Iteration Clustering
represents the first use of GraphX routines within MLLIB; ORCFiles Support;
FP-Growth
 Bring Spark & Apps on top of Spark into leading telcos globally as Spark is
cornerstone of Huawei big data vertical solution
Huawei’s Long Term Commitment to Spark & Ecosystem
Page ‹20›
HUAWEI TECHNOLOGIES Co., Ltd.
Huawei Planned Spark Roadmap
Spark SQL & Core
1H 2015
 Co-Processor, optimization
 Spark SQL on Hbase
 OrcFiles Support
 Vectorized Processing
ML & Streaming
2H 2015
 Nested Data
Spark on Yarn
 Power iteration
Clustering
 PAM K-Medoids
Streaming Analysis Algorithms
New requirements
from Spark
implementation
 Materialized View
 Spark R
GraphX in Telco Data
Modeling
 SQL99’2003
Compliance features
Page ‹21›
HUAWEI TECHNOLOGIES Co., Ltd.
 Reynold Xin, Michael Armbrust at Databricks reviewed Spark SQL
on HBase design document and the code, provided feedback and helped
improve the design
 Xiangrui Meng at Databricks provided the guidance, reviewed and
modified the code MLLib Power Iteration Clustering algorithm (Spark 1.3
release) & FP-Growth algorithm (Spark 1.3 release)
 Huawei Big Data team in India & Hangzhou, China provided the
performance testing/tuning and participated in code development
Acknowledgements
Page ‹22›
HUAWEI TECHNOLOGIES Co., Ltd.
Page ‹23›
HUAWEI TECHNOLOGIES Co., Ltd.
Phoenix Architecture and Data Access Path
Phoenix Coprocessor Phoenix Coprocessor Phoenix Coprocessor
Phoenix as
HBase Client
Page ‹24›
HUAWEI TECHNOLOGIES Co., Ltd.
HBase vs. Cassandra vs. RDBMS
HBase Cassandra RDBMS
Special Nodes Master Seed Coordinator
Synchronization Mechanism Zookeeper Gossip protocol …
CAP Properties CP AP CA
Data Access Shell, REST,
Java/Thrift API
CQL, Shell, Thrift SQL/JDBC/ODBC
Data Size PBs PBs TBs
Coprocessor/In-DB processing Yes No Yes
Origins Google BigTable Amazon Dynamo
+ Google BigTable
IBM System R
Native to Hadoop
(inclusion in Hadoop distributions,
Hadoop/HDFS specific
optimizations, …)
Yes No No
Index-Organization Single row index Single row index On any columns
Dominant Vendor Backing/Lock-in No Datastax Oracle, MS, IBM
Popular use scenarios Range queries; consistency; fast
reads:
Facebook Messenger
Geographical distributed cluster;
large deployments:
Twitter
Transactions, DW/DM
HUAWEI TECHNOLOGIES Co., Ltd.
Spark SQL on HBase
Spark SQL
Spark
Master
HBase Master
Spark Slave Spark Slave
HBase Region
Server
HBase Region
Server
Architecture and Data Access Paths
Zookeeper Quorum

Contenu connexe

Tendances

Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
Kristian Alexander
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 

Tendances (20)

Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 
Spark etl
Spark etlSpark etl
Spark etl
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Spark SQL with Scala Code Examples
Spark SQL with Scala Code ExamplesSpark SQL with Scala Code Examples
Spark SQL with Scala Code Examples
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Community
 
Introduction to Spark SQL & Catalyst
Introduction to Spark SQL & CatalystIntroduction to Spark SQL & Catalyst
Introduction to Spark SQL & Catalyst
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
Introduction to Spark SQL training workshop
Introduction to Spark SQL training workshopIntroduction to Spark SQL training workshop
Introduction to Spark SQL training workshop
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 

En vedette

Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
vithakur
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
Alfredo Abate
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Spark Summit
 

En vedette (20)

Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
Hands on lab Elasticsearch
Hands on lab ElasticsearchHands on lab Elasticsearch
Hands on lab Elasticsearch
 
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 SparkSQL et Cassandra - Tool In Action Devoxx 2015 SparkSQL et Cassandra - Tool In Action Devoxx 2015
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 
The SparkSQL things you maybe confuse
The SparkSQL things you maybe confuseThe SparkSQL things you maybe confuse
The SparkSQL things you maybe confuse
 
Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
 
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the Art
 
The DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to ProductionThe DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to Production
 
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
 
Oracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesOracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA Technologies
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Apache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandApache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to Understand
 

Similaire à Spark meetup v2.0.5

Chandan's_Resume
Chandan's_ResumeChandan's_Resume
Chandan's_Resume
Chandan Das
 
Rajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developerRajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developer
Rajeev Kumar
 
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsWeb Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Kognitio
 

Similaire à Spark meetup v2.0.5 (20)

Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformModernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
 
Chandan's_Resume
Chandan's_ResumeChandan's_Resume
Chandan's_Resume
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by Example
 
Hortonworks.bdb
Hortonworks.bdbHortonworks.bdb
Hortonworks.bdb
 
Trafodion – an enterprise class sql based on hadoop
Trafodion – an enterprise class sql based on hadoopTrafodion – an enterprise class sql based on hadoop
Trafodion – an enterprise class sql based on hadoop
 
Arindam Sengupta _ Resume
Arindam Sengupta _ ResumeArindam Sengupta _ Resume
Arindam Sengupta _ Resume
 
Rajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developerRajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developer
 
BigData_Krishna Kumar Sharma
BigData_Krishna Kumar SharmaBigData_Krishna Kumar Sharma
BigData_Krishna Kumar Sharma
 
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsWeb Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
2017 OpenWorld Keynote for Data Integration
2017 OpenWorld Keynote for Data Integration2017 OpenWorld Keynote for Data Integration
2017 OpenWorld Keynote for Data Integration
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop Solution
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
 
Resume Abhishek Vijaywargiya: Database Developer with 9 years of experience i...
Resume Abhishek Vijaywargiya: Database Developer with 9 years of experience i...Resume Abhishek Vijaywargiya: Database Developer with 9 years of experience i...
Resume Abhishek Vijaywargiya: Database Developer with 9 years of experience i...
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 

Dernier

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Dernier (20)

How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 

Spark meetup v2.0.5

  • 1. HUAWEI TECHNOLOGIES CO., LTD. Spark SQL on HBASE Spark Meet-up Tech Talks Yan Zhou/Bing Xiao March 25, 2015
  • 2. HUAWEI TECHNOLOGIES Co., Ltd.  Lead architect of Huawei Big Data Platform  Apache Pig committer  ex-Hadooper @ Yahoo!  15+ years of experience in DB, OLAP, distributed computing fields. About me Page ‹2›
  • 3. HUAWEI TECHNOLOGIES Co., Ltd.  A Fortune Global 500, Private Company  Annual growth at 15% (revenue up to $46 Billion in 2014)  More than $6.5 Billion investment on R&D  Transition from telecom-equipment manufacturer to a leader of ICT (information and communications technology)  Big data and open source are part of company-wide strategies About Huawei Page ‹3›
  • 4. HUAWEI TECHNOLOGIES Co., Ltd.  Spark SQL on HBase o Motivations o Data Model o SQL Semantics o Performance o Value Proposition  Demo  Roadmap  Q/A Agenda Page ‹4›
  • 5. HUAWEI TECHNOLOGIES Co., Ltd.  Driven by use cases in verticals including telco  Telco Data is unique & complex  Flexible data organization for various types of queries: range, ad-hoc, interactive, DW  Stay in tune: planned sessions in future events Customer & Billing Data • Well Structured • TB Session Signaling Data: • Multi-Device generated • Hundreds TB-PB • Real-Time Biz Oriented MR/CHR Data: • Semi-structured & Nested • PB • Location-centered Data xDR Data: • 1~10PB each month Network Raw Data • Unstructured • TB/Sec • Linear Growth with Biz CRM Billing Signaling Data MR/CHR Type Data xDR Data Network Raw Data Note:In a typical network of ~30M subscribers and the data flow around 1TB/Sec. Page ‹5› Motivations
  • 6. HUAWEI TECHNOLOGIES Co., Ltd.  NOSQL Key-Value data store on Hadoop  Following Google BigTable model  Emerging platform for scale-out relational data stores on Hadoop: o Splice Machines o Trafodion (HP) o Apache Phoenix (Salesforce) o Kylin (eBay)  M/R and API-based data access interfaces only Page ‹6› What is HBase?
  • 7. HUAWEI TECHNOLOGIES Co., Ltd. Spark Core Spark SQL Existing HBase Access Path HadoopRDD TableInput/OutputFormat Features: Hadoop M/R plug-in Inflexible/Hard to use Limited pushdown capabilities High Latency HiveContext metastore Page ‹7› HBase
  • 8. HUAWEI TECHNOLOGIES Co., Ltd. New Data Access Path metadata Featuring: • Fully distributed processing engine for scalability and fault tolerance • Scala/Java/Python APIs • Pluggable data source to Spark SQL through Spark SQL API • Enable systematic and powerful handling of pushdowns (key range, filters, coprocessor) • More SQL capabilities made possible (Primary Key, Update, INSERT INTO … VALUES, 2nd index, Bloom Filter, …) Page ‹8› Spark SQL Spark Core HBase
  • 9. HUAWEI TECHNOLOGIES Co., Ltd.  Logical Data Model the same as Spark SQL: relational and type system.  Physical Data Model: • Support of composite primary keys • HBase rowkey of byte representation of composite primary keys • Logical non-key columns mapped onto <column family, column qualifier> • Persistent metadata on a special HBase table • Presplit tables supported Data Models Page ‹9›
  • 10. HUAWEI TECHNOLOGIES Co., Ltd.  Based on Spark SQL syntax, plus …  DDL: • CREATE TABLE table_name (col1 TYPE1, col2 TYPE2, …, PRIMARY KEY (col7, col1, col3)) MAPPED BY (hbase_tablename, COLS=[col2=cf1.cq11, col4=cf1.cq12, col5=cf2.cq21, col6=cf2.cq22]) •ALTER TABLE table_name ADD/DROP column …  DML: • INSERT … INTO VALUES …  Bulk Loading: • LOAD DATA [PARALLEL] INPATH filePath [OVERWRITE] INTO TABLE tableName [FIELDS TERMINATED BY char] SQL Semantics Page ‹10›
  • 11. HUAWEI TECHNOLOGIES Co., Ltd.  Precise partition pruning and partition-specific multidimensional predicate pushdowns based on partial evaluation of filter boolean expressions for queries Query Optimization Approach => Itemid > 300 AND amount < 30 => customer=‘John’ AND itemid < 100 AND amount > 200 E.g. a sales table with <customer, itemid> as a 2-dimensional primary key SELECT * from sales WHERE ((customer=‘Joe’ AND itemid > 300 AND amount < 30) OR (customer=‘John’ AND itemid < 100) AND amount > 200) The existing partitions/regions are: 1. (, ‘Ashley’) 2. [‘Ashley’, “Iris”) 3. [(‘Joe’, 10), (‘Joe’, 100)), 4. [(‘Joe’, 200), (‘Joe’, 1000)) 5. [‘John’, ‘York’) 6. [‘York’, )  The algorithms are generic and applicable to other organized data sets like hash-partitioned Hive tables as well.  Suitable for interactive ad hoc queries Page ‹11› for scan range for filtering
  • 12. HUAWEI TECHNOLOGIES Co., Ltd.  Queries (TPC-DS, 10M records): Query Performance Page ‹12› SQL Query SparkSQL on HBase (Seconds) Phoenix (Seconds) 1-key-range select count(1) from store_sales where (ss_item_sk = 99 and ss_ticket_number > 1000); 0.18 0.03 2-key-range select count(1) from store_sales where (ss_item_sk = 99 and ss_ticket_number > 1000) or (ss_item_sk = 5000 and ss_ticket_number < 20000); 0.22 4.29 3-key-range select count(1) from store_sales where (ss_item_sk = 99 and ss_ticket_number > 1000) or (ss_item_sk = 5000 and ss_ticket_number < 20000) or (ss_item_sk = 28000 and ss_ticket_number <= 10000); 0.27 4.44 Aggregate on the secondary key select count(1) from store_sales group by ss_ticket_number; 37 79 • Cluster: o 1 master + 6 slaves with 48GB/node o Xeon 2.4G 16 cores
  • 13. HUAWEI TECHNOLOGIES Co., Ltd.  Query performance (TPC-DS, 10M records): Query Performance 0.18 0.22 0.27 37 0.03 4.29 4.44 79 0 10 20 30 40 50 60 70 80 90 1-key-range 2-key-range 3-key-range aggregate on secondary key Seconds Spark SQL on HBase Phoenix Page ‹13›
  • 14. HUAWEI TECHNOLOGIES Co., Ltd.  Performance optimization for tabular data bulk loading • late materialization of KeyValue cells  reduction of shuffle data volume • removal of sorting by reducers  lightweight reducer  more scalable • best effort to colocate reducers with the region servers  Optional parallel incremental loading after M/R in the bulk loader Bulk Load Optimization Page ‹14›
  • 15. HUAWEI TECHNOLOGIES Co., Ltd.  Loading performance (TPC-DS, 10M records): Bulk Load Performance 557 185 1093 762 0 200 400 600 800 1000 1200 Load (no presplit) Load (6 presplit regions) Seconds Spark SQL on HBase Phoenix Page ‹15› • Cluster: o 1 master + 6 slaves with 48GB/node o Xeon 2.4G 16 cores
  • 16. HUAWEI TECHNOLOGIES Co., Ltd.  Combined capabilities of Spark, SparkSQL and HBase o Spark Dataframe supported  More traditional DBMS capabilities made possible on HBase  Basis to build a highly performing and concurrent distributed big data SQL system  Optimized bulk loader for tabular data sets  Performance excellence Value Proposition Page ‹16›
  • 17. HUAWEI TECHNOLOGIES Co., Ltd.  Source repo: https://github.com/Huawei-Spark/hbase/  emails: yan.zhou.sc@huawei.com, bing.xiao@huawei.com  Team members: Bo Meng, Xinyun Huang, Wang Fei, Stanley Burnitt, Shijun(Ken) Ma, Jacky Li, Stephen Boesch Plus our Big Data Teams in India and Hangzhou, China Project Info  Join us … o Comments, tryouts, and contributions are more than welcome o Open for Joint development in next phase project(s) o We are hiring: Big Data Engineers/Spark Fans Page ‹17›
  • 18. HUAWEI TECHNOLOGIES Co., Ltd.  Environment  Hardware: 9-node (1 master + 8 slaves) blades  Software: Linux Enterprise Server 11.1  Data set: TPC-DS 10M records   Indexed-range query vs. full scan query through use of the DataFrame  Join between a Hbase table with an in-memory side table through use of Dataframe  Partial Evaluation • Schema = (id : Int, age: Int) • The row to be partially evaluated on is: ([1,5), null) • Predicate 1: (id < 1) OR (age > 30) => (age > 30) • Predicate 2: (id < 6) OR (age > 30) => True Demo Page ‹18›
  • 19. HUAWEI TECHNOLOGIES Co., Ltd.  Targeting Spark 1.4  Coprocessor/Custom Filter  Filter/Partitions from CAST values in predicates  Latency reduction: Spark-3306 for external resource management  Optimizations of Sorting/Aggregation/Join on primary keys  Support of salting, timestamp, dynamic columns, nested data type, …  Views and materialized views Future Plan Page ‹19›
  • 20. HUAWEI TECHNOLOGIES Co., Ltd.  Earliest Corporate sponsor of AMP Lab & its projects including Spark  One of Leading contributors to Spark: 10 & 11 contributors in Spark 1.2 & 1.3 releases  Highlighted contributions of New features : Power Iteration Clustering represents the first use of GraphX routines within MLLIB; ORCFiles Support; FP-Growth  Bring Spark & Apps on top of Spark into leading telcos globally as Spark is cornerstone of Huawei big data vertical solution Huawei’s Long Term Commitment to Spark & Ecosystem Page ‹20›
  • 21. HUAWEI TECHNOLOGIES Co., Ltd. Huawei Planned Spark Roadmap Spark SQL & Core 1H 2015  Co-Processor, optimization  Spark SQL on Hbase  OrcFiles Support  Vectorized Processing ML & Streaming 2H 2015  Nested Data Spark on Yarn  Power iteration Clustering  PAM K-Medoids Streaming Analysis Algorithms New requirements from Spark implementation  Materialized View  Spark R GraphX in Telco Data Modeling  SQL99’2003 Compliance features Page ‹21›
  • 22. HUAWEI TECHNOLOGIES Co., Ltd.  Reynold Xin, Michael Armbrust at Databricks reviewed Spark SQL on HBase design document and the code, provided feedback and helped improve the design  Xiangrui Meng at Databricks provided the guidance, reviewed and modified the code MLLib Power Iteration Clustering algorithm (Spark 1.3 release) & FP-Growth algorithm (Spark 1.3 release)  Huawei Big Data team in India & Hangzhou, China provided the performance testing/tuning and participated in code development Acknowledgements Page ‹22›
  • 23. HUAWEI TECHNOLOGIES Co., Ltd. Page ‹23›
  • 24. HUAWEI TECHNOLOGIES Co., Ltd. Phoenix Architecture and Data Access Path Phoenix Coprocessor Phoenix Coprocessor Phoenix Coprocessor Phoenix as HBase Client Page ‹24›
  • 25. HUAWEI TECHNOLOGIES Co., Ltd. HBase vs. Cassandra vs. RDBMS HBase Cassandra RDBMS Special Nodes Master Seed Coordinator Synchronization Mechanism Zookeeper Gossip protocol … CAP Properties CP AP CA Data Access Shell, REST, Java/Thrift API CQL, Shell, Thrift SQL/JDBC/ODBC Data Size PBs PBs TBs Coprocessor/In-DB processing Yes No Yes Origins Google BigTable Amazon Dynamo + Google BigTable IBM System R Native to Hadoop (inclusion in Hadoop distributions, Hadoop/HDFS specific optimizations, …) Yes No No Index-Organization Single row index Single row index On any columns Dominant Vendor Backing/Lock-in No Datastax Oracle, MS, IBM Popular use scenarios Range queries; consistency; fast reads: Facebook Messenger Geographical distributed cluster; large deployments: Twitter Transactions, DW/DM
  • 26. HUAWEI TECHNOLOGIES Co., Ltd. Spark SQL on HBase Spark SQL Spark Master HBase Master Spark Slave Spark Slave HBase Region Server HBase Region Server Architecture and Data Access Paths Zookeeper Quorum

Notes de l'éditeur

  1. owned by employees
  2. e.g. signalling data, Each day around 30TB; need at least 3 months
  3. Kiji, Hydrabase (Facebook)