SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Daoyuan Wang (Intel)
Yuanjian Li (Baidu)
OAP: Optimized Analytics
Package for Spark Platform
Notice and Disclaimers:
• Intel, the Intel logo are trademarks of IntelCorporation in the U.S. and/or other countries. *Othernames and brandsmay be
claimed as the property of others.
See Trademarkson intel.com for fulllist of Intel trademarks.
• Optimization Notice:
Intel's compilers may or may not optimize to the same degree for non-Intelmicroprocessorsfor optimizations that are not
unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other
optimizations. Inteldoes not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors
not manufactured by Intel.
Microprocessor-dependentoptimizations in this product are intended for use with Intelmicroprocessors. Certain
optimizations not specific to Intelmicroarchitecture are reserved for Intelmicroprocessors. Please refer to the applicable
product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
• Intel technologies may require enabled hardware, specific software, or servicesactivation. Checkwith your system
manufacturer or retailer.
• No computer systemcan be absolutely secure. Inteldoes not assumeany liability for lost or stolen data or systems or any
damages resulting from such losses.
• You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning
Intel products described herein. You agree to grant Intela non-exclusive, royalty-free license to any patent claim thereafter
drafted which includes subject matter disclosed herein.
• No license (express or implied, by estoppelor otherwise) to any intellectualpropertyrights is granted by this document.
• The products described may contain design defectsor errorsknownas errata which maycausethe product to deviate from
publish.
About me
Daoyuan Wang
• developer@Intel
• Focuses on Spark
optimization
• An active Spark
contributor since 2014
Yuanjian Li
• Baidu INF distributed
computation
• Apache Spark
contributor
• Baidu Spark team
leader
Agenda
• Background for OAP
• Key features
• Benchmark
• OAP and Spark in Baidu
• Future plans
Agenda
• Background for OAP
• Key features
• Benchmark
• OAP and Spark in Baidu
• Future plans
Data Analytics in Big Data Definition
• People wants OLAP
against large dataset
as fast as possible.
• People wants extract
information from new
coming data as soon
as possible.
Data Analytics Acceleration is
Required by Spark Users
http://cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_Content/2016_Spark_Survey/2016_Spark_Infographic.pdf
Emerging hardware technology
Intel® Optane™ Technology
Data Center Solutions
Accelerate applications for
fast caching and storage,
reduce transaction costs for
latency-sensitive workloads
and increase scale per server.
Intel® Optane™ technology
allows data centers to deploy
bigger and more affordable
datasets to gain new insights
from large memory pools.
Our proposal – OAP
Spark* Job Server
Spark SQL / StructuredStreaming/ Core
Cassandra* HBase*Redis*Alluxio*
HDFS* S3* … Storage Layer
Hive* Table Parquet * JSON * ORC *
Redis *
Connector
Cassandra *
Connector
OAP (Codename “Spinach”)
• IndexedDataSource / CacheAware
• RDMA, QAT, ISA-L,FPGA …
• User Customized Indices
• Columnar formats & supportParquet, ORC
• Runtime ComputingV.S.Data Store
• Columnar Fine-grainedCache
• Spark Executor in-process Cache
• 3D Xpoint (APP Direct Mode)
• Auto tuningbasedonperiodicaljobhistory
• K8S Integration/ AES-NI Encryption
Why OAP
Low cost
• Makes full use of
existing hardware
• Open source
Good
Performance
• Index just like
traditional database
• Up to 5x boost in
real-world
Easy to Use
• Easy to deploy
• Easy to maintain
• Easy to learn
Agenda
• Background for OAP
• Key features
• Benchmark
• OAP and Spark in Baidu
• Future plans
A Simple Example
1. Run with OAP
$SPARK_HOME/sbin/start-thriftserver --package oap.jar;
2. Create a OAP table
beeline> CREATE TABLE src(a: Int, b: String) USING spn;
3. Create a single column B+ Tree index
beeline> CREATE SINDEX idx_1 ON src (a) USING BTREE;
4. Insert data
beeline> INSERT INTO TABLE src SELECT key, value FROM xxx;
5. Refresh index
beeline> REFRESH SINDEX on src;
6. Execution would automatically utilize index
beeline> SELECT MAX(value), MIN(value) FROM src WHERE a > 100 and a <
1000;
OAP Files and Fibers
Column (Fiber) #1
Column (Fiber) #2
Column (Fiber) #N
RowGroup #1
…RowGroup #2
RowGroup #N
Index meta
statistics
Index data
structure
(Index Fiber)
One Index file
for every data
file
Index meta
statistics
Index data
structure
(Index Fiber)
OAP meta file
OAP data
files
OAP
index files
OAP
index files
14
OAP Internals - index
Spark predicate
push down
FilteredScan
Read OAP Meta
Available
index?
read statistics
before use index
Get Local RowID
from index
Full table scan
Access data file for
RowIDs directly
Y
N
OAP cached access
Index selection
Supports Btree Index
and BitMap Index, find
best match among all
created indices
Supports statistics such
as MinMax, PartbyValue,
Sample, BloomFilter
Only reads data fibers
we need and puts those
fibers into cache (in-
memory fiber)
OAP compatible layer
RowGroup #k
RowGroup #1
RowGroup #2
Parquet compatible layer
Read row #m from parquet file
Find Row group #k
Read row group and
get specific rows
Parquet data file
Cache
OAP Data locality
Spark	as	a	Service
Meta Data
FiberCacheManager
Executor
Index
Storage(HDFS	/	S3	/	OSS)
SpinachContext	(Driver)
FiberSensor
HeartBeat
Agenda
• Background for OAP
• Key features
• Benchmark
• OAP and Spark in Baidu
• Future plans
Performance
72.083
7.095
2.304
0
10
20
30
40
50
60
70
80
Parquet Vectorized Read OAP Indexed Read OAP Indexed Read with
Fiber Cache
QueryTime(seconds)
OAP Index And Cache Performance
Cluster:
1 Master + 2 Slaves
Hardware:
CPU – 2x E5-2699 v4
RAM – 256 GB
Storage – S3610 1.6TB
Data:
300GB (Compressed Parquet)
2 Billion Records
Agenda
• Background for OAP
• Key features
• Benchmark
• OAP and Spark in Baidu
• Future plans
Spark In Baidu
• Spark import to Baidu
• Version: 0.8
80
1000
3000
6500
50 300
1500
5800
0
1000
2000
3000
4000
5000
6000
7000
Nodes Jobs/day
2014 2015 2016 2017
• Build standalone
cluster
• Integratewith in-
houseFSPub-
SubDW
• Version: 1.4
• Build Cluster over
YARN
• Integratewith in-
houseResource
Scheduler System
• Version: 1.6
• SQLGraph Service
over Spark
• OAP
• Version: 2.1
Baidu Big SQLBaiduBigSQL
Web UI Restful API
BBS HTTPServer
BBS Worker BBS Worker BBS Worker
BBS Master
Cache & Index Layer(OAP)
Spark Over Yarn
Roll Up Table Layer
API Layer:
• Meta Control API
• Job API:
LoadExportQueryInde
x Control
Control Layer:
• Meta Control
• Job Scheduler
• Spark Driver
• Query Classification
Boosting Layer:
• Roll Up Table
Management
• Roll Up Query
Change
• Index CreateUpdate
• CacheHit
Baidu Big SQL
Query Physical Queue(FAIR)
Import Physical Queues
BBS Worker
Big Query
Pool Small Query Pool
Index Create
Pool
BBS Master
Import Physical Queues
Load Physical Queues
Spark Over YARN
Data Sources
Logs DW
Load Job
alter table create indexclassify query
Resource Management & Isolation
Query Job
Introductory Story
Introductory Story
Get the top 10
charge sum and
correspond
advertiser which
triggered by the
query word‘flower’
• Create index on ‘userid’ column
• Various index types to choosefor
different fields types
• ×5 speed boosting than native
spark sql, ×80 than MR Job
• 3 day baidu charging log, 4TB
data,70000+files, query timein
10~15s
Roll Up Table Layer
date userid searchid baiduid cmatch
…
…
shows clicks charge
1 1 1 10 2 10 1 5
1 1 2 11 3 10 1 5
1 1 3 12 2 10 1 5
1 1 4 13 1 10 1 5
1 1 5 14 1 10 1 5
1 2 6 14 2 10 1 5
1 2 7 15 3 10 1 5
1 2 8 16 4 10 1 5
1 2 9 17 5 10 1 5
700+ Columns
99% query only use <10 columns
Select date,userid,shows,clicks,charge from…
date userid shows clicks charge
1 1 50 5 25
1 2 40 4 20
Multi Roll Up Table
(user-transparent)
date cmatch shows clicks charge
1 1 20 2 10
1 2 30 3 15
1 3 20 2 10
1 4 10 1 5
1 5 10 1 5
OAP In BigSQL
… Name Department Age …
… … … … …
… John INF 35 …
… Michelle AI-Lab 29 …
… Amy INF 42 …
… Kim AI-Lab 27 …
… Mary AI-Lab 47 …
… … … … …
DataFile
IndexFile
Sorted Age Row Index
in Data File
27 3
29 1
35 0
42 2
45 4
Department Bit Array
INF 10100
AI-Lab 01011
Index Build
NormalTableScan
UseIndex
Skippable Reader
Select xxx from xxx where age > 29 and department in (INF, AI-Lab)
OAP In BigSQL
… Name Department Age …
… … … … …
… John INF 35 …
… Michelle AI-Lab 29 …
… Amy INF 42 …
… Kim AI-Lab 27 …
… Mary AI-Lab 47 …
… … … … …
DataFile
InMemoryCache
Load Cache
Department Row Index
in Data File
INF 2
AI-Lab 3
Age Row Index
in Data File
35 0
29 1
BBS’s Contribute to Spark
• Spark-4502
Spark SQL reads unneccesary nested fields from Parquet
• Spark-18700
getCached in HiveMetastoreCatalog not thread safe cause driver OOM
• Spark-20408
Get glob path in parallel to reduce resolve relation time
• …
Agenda
• Background for OAP
• Key features
• Benchmark
• OAP and Spark in Baidu
• Future plans
Future plans
• Compatible with more data formats
• Explicit cache and cache management
• Optimize SQL operators (join, aggregate) with index
• Integrate with structured streaming
• Utilize Latest hardware technology, such as Intel QAT
or 3D XPoint.
• Welcome to contribute!
https://github.com/Intel-bigdata/OAP
Thank You.
daoyuan.wang@intel.com
liyuanjian@baidu.com

Contenu connexe

Tendances

Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn confluent
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...Databricks
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Databricks
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit
 
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangExperiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangDatabricks
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuDatabricks
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPDatabricks
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit
 
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Databricks
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Databricks
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Spark Summit
 
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan FritzAnalytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan FritzDatabricks
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Spark Summit
 

Tendances (20)

Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
 
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangExperiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
 
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...
 
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan FritzAnalytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
 

Similaire à OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yuanjian Li

Hyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache SparkHyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache SparkDatabricks
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...Hirofumi Iwasaki
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareDatabricks
 
Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
Leveraging Apache Spark to Develop AI-Enabled Products and Services at BoschLeveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
Leveraging Apache Spark to Develop AI-Enabled Products and Services at BoschDatabricks
 
MySQL 5.6 - Operations and Diagnostics Improvements
MySQL 5.6 - Operations and Diagnostics ImprovementsMySQL 5.6 - Operations and Diagnostics Improvements
MySQL 5.6 - Operations and Diagnostics ImprovementsMorgan Tocker
 
MySQL 5.6, news in 5.7 and our HA options
MySQL 5.6, news in 5.7 and our HA optionsMySQL 5.6, news in 5.7 and our HA options
MySQL 5.6, news in 5.7 and our HA optionsTed Wennmark
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore IndexSolidQ
 
Tony Reid Resume
Tony Reid ResumeTony Reid Resume
Tony Reid Resumestoryhome
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
 
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications ShowcaseIntel IT Center
 
Intel® Xeon® processor E7-8800/4800 v3 Application Showcase
Intel® Xeon® processor E7-8800/4800 v3 Application ShowcaseIntel® Xeon® processor E7-8800/4800 v3 Application Showcase
Intel® Xeon® processor E7-8800/4800 v3 Application ShowcaseIntel IT Center
 
Getting Started with Splunk Breakout Session
Getting Started with Splunk Breakout SessionGetting Started with Splunk Breakout Session
Getting Started with Splunk Breakout SessionSplunk
 
香港六合彩
香港六合彩香港六合彩
香港六合彩taoyan
 
M|18 Intel and MariaDB: Strategic Collaboration to Enhance MariaDB Functional...
M|18 Intel and MariaDB: Strategic Collaboration to Enhance MariaDB Functional...M|18 Intel and MariaDB: Strategic Collaboration to Enhance MariaDB Functional...
M|18 Intel and MariaDB: Strategic Collaboration to Enhance MariaDB Functional...MariaDB plc
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)Cheer Chain Enterprise Co., Ltd.
 

Similaire à OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yuanjian Li (20)

Hyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache SparkHyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache Spark
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 
JSSUG: SQL Sever Index Tuning
JSSUG: SQL Sever Index TuningJSSUG: SQL Sever Index Tuning
JSSUG: SQL Sever Index Tuning
 
Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
Leveraging Apache Spark to Develop AI-Enabled Products and Services at BoschLeveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
 
MySQL 5.6 - Operations and Diagnostics Improvements
MySQL 5.6 - Operations and Diagnostics ImprovementsMySQL 5.6 - Operations and Diagnostics Improvements
MySQL 5.6 - Operations and Diagnostics Improvements
 
MySQL 5.6, news in 5.7 and our HA options
MySQL 5.6, news in 5.7 and our HA optionsMySQL 5.6, news in 5.7 and our HA options
MySQL 5.6, news in 5.7 and our HA options
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
Tony Reid Resume
Tony Reid ResumeTony Reid Resume
Tony Reid Resume
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
 
Intel® Xeon® processor E7-8800/4800 v3 Application Showcase
Intel® Xeon® processor E7-8800/4800 v3 Application ShowcaseIntel® Xeon® processor E7-8800/4800 v3 Application Showcase
Intel® Xeon® processor E7-8800/4800 v3 Application Showcase
 
Getting Started with Splunk Breakout Session
Getting Started with Splunk Breakout SessionGetting Started with Splunk Breakout Session
Getting Started with Splunk Breakout Session
 
香港六合彩
香港六合彩香港六合彩
香港六合彩
 
M|18 Intel and MariaDB: Strategic Collaboration to Enhance MariaDB Functional...
M|18 Intel and MariaDB: Strategic Collaboration to Enhance MariaDB Functional...M|18 Intel and MariaDB: Strategic Collaboration to Enhance MariaDB Functional...
M|18 Intel and MariaDB: Strategic Collaboration to Enhance MariaDB Functional...
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 

Dernier (20)

Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 

OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yuanjian Li

  • 1. Daoyuan Wang (Intel) Yuanjian Li (Baidu) OAP: Optimized Analytics Package for Spark Platform
  • 2. Notice and Disclaimers: • Intel, the Intel logo are trademarks of IntelCorporation in the U.S. and/or other countries. *Othernames and brandsmay be claimed as the property of others. See Trademarkson intel.com for fulllist of Intel trademarks. • Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intelmicroprocessorsfor optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Inteldoes not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependentoptimizations in this product are intended for use with Intelmicroprocessors. Certain optimizations not specific to Intelmicroarchitecture are reserved for Intelmicroprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. • Intel technologies may require enabled hardware, specific software, or servicesactivation. Checkwith your system manufacturer or retailer. • No computer systemcan be absolutely secure. Inteldoes not assumeany liability for lost or stolen data or systems or any damages resulting from such losses. • You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intela non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. • No license (express or implied, by estoppelor otherwise) to any intellectualpropertyrights is granted by this document. • The products described may contain design defectsor errorsknownas errata which maycausethe product to deviate from publish.
  • 3. About me Daoyuan Wang • developer@Intel • Focuses on Spark optimization • An active Spark contributor since 2014 Yuanjian Li • Baidu INF distributed computation • Apache Spark contributor • Baidu Spark team leader
  • 4. Agenda • Background for OAP • Key features • Benchmark • OAP and Spark in Baidu • Future plans
  • 5. Agenda • Background for OAP • Key features • Benchmark • OAP and Spark in Baidu • Future plans
  • 6. Data Analytics in Big Data Definition • People wants OLAP against large dataset as fast as possible. • People wants extract information from new coming data as soon as possible.
  • 7. Data Analytics Acceleration is Required by Spark Users http://cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_Content/2016_Spark_Survey/2016_Spark_Infographic.pdf
  • 8. Emerging hardware technology Intel® Optane™ Technology Data Center Solutions Accelerate applications for fast caching and storage, reduce transaction costs for latency-sensitive workloads and increase scale per server. Intel® Optane™ technology allows data centers to deploy bigger and more affordable datasets to gain new insights from large memory pools.
  • 9. Our proposal – OAP Spark* Job Server Spark SQL / StructuredStreaming/ Core Cassandra* HBase*Redis*Alluxio* HDFS* S3* … Storage Layer Hive* Table Parquet * JSON * ORC * Redis * Connector Cassandra * Connector OAP (Codename “Spinach”) • IndexedDataSource / CacheAware • RDMA, QAT, ISA-L,FPGA … • User Customized Indices • Columnar formats & supportParquet, ORC • Runtime ComputingV.S.Data Store • Columnar Fine-grainedCache • Spark Executor in-process Cache • 3D Xpoint (APP Direct Mode) • Auto tuningbasedonperiodicaljobhistory • K8S Integration/ AES-NI Encryption
  • 10. Why OAP Low cost • Makes full use of existing hardware • Open source Good Performance • Index just like traditional database • Up to 5x boost in real-world Easy to Use • Easy to deploy • Easy to maintain • Easy to learn
  • 11. Agenda • Background for OAP • Key features • Benchmark • OAP and Spark in Baidu • Future plans
  • 12. A Simple Example 1. Run with OAP $SPARK_HOME/sbin/start-thriftserver --package oap.jar; 2. Create a OAP table beeline> CREATE TABLE src(a: Int, b: String) USING spn; 3. Create a single column B+ Tree index beeline> CREATE SINDEX idx_1 ON src (a) USING BTREE; 4. Insert data beeline> INSERT INTO TABLE src SELECT key, value FROM xxx; 5. Refresh index beeline> REFRESH SINDEX on src; 6. Execution would automatically utilize index beeline> SELECT MAX(value), MIN(value) FROM src WHERE a > 100 and a < 1000;
  • 13. OAP Files and Fibers Column (Fiber) #1 Column (Fiber) #2 Column (Fiber) #N RowGroup #1 …RowGroup #2 RowGroup #N Index meta statistics Index data structure (Index Fiber) One Index file for every data file Index meta statistics Index data structure (Index Fiber) OAP meta file OAP data files OAP index files OAP index files
  • 14. 14 OAP Internals - index Spark predicate push down FilteredScan Read OAP Meta Available index? read statistics before use index Get Local RowID from index Full table scan Access data file for RowIDs directly Y N OAP cached access Index selection Supports Btree Index and BitMap Index, find best match among all created indices Supports statistics such as MinMax, PartbyValue, Sample, BloomFilter Only reads data fibers we need and puts those fibers into cache (in- memory fiber)
  • 15. OAP compatible layer RowGroup #k RowGroup #1 RowGroup #2 Parquet compatible layer Read row #m from parquet file Find Row group #k Read row group and get specific rows Parquet data file Cache
  • 16. OAP Data locality Spark as a Service Meta Data FiberCacheManager Executor Index Storage(HDFS / S3 / OSS) SpinachContext (Driver) FiberSensor HeartBeat
  • 17. Agenda • Background for OAP • Key features • Benchmark • OAP and Spark in Baidu • Future plans
  • 18. Performance 72.083 7.095 2.304 0 10 20 30 40 50 60 70 80 Parquet Vectorized Read OAP Indexed Read OAP Indexed Read with Fiber Cache QueryTime(seconds) OAP Index And Cache Performance Cluster: 1 Master + 2 Slaves Hardware: CPU – 2x E5-2699 v4 RAM – 256 GB Storage – S3610 1.6TB Data: 300GB (Compressed Parquet) 2 Billion Records
  • 19. Agenda • Background for OAP • Key features • Benchmark • OAP and Spark in Baidu • Future plans
  • 20. Spark In Baidu • Spark import to Baidu • Version: 0.8 80 1000 3000 6500 50 300 1500 5800 0 1000 2000 3000 4000 5000 6000 7000 Nodes Jobs/day 2014 2015 2016 2017 • Build standalone cluster • Integratewith in- houseFSPub- SubDW • Version: 1.4 • Build Cluster over YARN • Integratewith in- houseResource Scheduler System • Version: 1.6 • SQLGraph Service over Spark • OAP • Version: 2.1
  • 21. Baidu Big SQLBaiduBigSQL Web UI Restful API BBS HTTPServer BBS Worker BBS Worker BBS Worker BBS Master Cache & Index Layer(OAP) Spark Over Yarn Roll Up Table Layer API Layer: • Meta Control API • Job API: LoadExportQueryInde x Control Control Layer: • Meta Control • Job Scheduler • Spark Driver • Query Classification Boosting Layer: • Roll Up Table Management • Roll Up Query Change • Index CreateUpdate • CacheHit
  • 22. Baidu Big SQL Query Physical Queue(FAIR) Import Physical Queues BBS Worker Big Query Pool Small Query Pool Index Create Pool BBS Master Import Physical Queues Load Physical Queues Spark Over YARN Data Sources Logs DW Load Job alter table create indexclassify query Resource Management & Isolation Query Job
  • 24. Introductory Story Get the top 10 charge sum and correspond advertiser which triggered by the query word‘flower’ • Create index on ‘userid’ column • Various index types to choosefor different fields types • ×5 speed boosting than native spark sql, ×80 than MR Job • 3 day baidu charging log, 4TB data,70000+files, query timein 10~15s
  • 25. Roll Up Table Layer date userid searchid baiduid cmatch … … shows clicks charge 1 1 1 10 2 10 1 5 1 1 2 11 3 10 1 5 1 1 3 12 2 10 1 5 1 1 4 13 1 10 1 5 1 1 5 14 1 10 1 5 1 2 6 14 2 10 1 5 1 2 7 15 3 10 1 5 1 2 8 16 4 10 1 5 1 2 9 17 5 10 1 5 700+ Columns 99% query only use <10 columns Select date,userid,shows,clicks,charge from… date userid shows clicks charge 1 1 50 5 25 1 2 40 4 20 Multi Roll Up Table (user-transparent) date cmatch shows clicks charge 1 1 20 2 10 1 2 30 3 15 1 3 20 2 10 1 4 10 1 5 1 5 10 1 5
  • 26. OAP In BigSQL … Name Department Age … … … … … … … John INF 35 … … Michelle AI-Lab 29 … … Amy INF 42 … … Kim AI-Lab 27 … … Mary AI-Lab 47 … … … … … … DataFile IndexFile Sorted Age Row Index in Data File 27 3 29 1 35 0 42 2 45 4 Department Bit Array INF 10100 AI-Lab 01011 Index Build NormalTableScan UseIndex Skippable Reader Select xxx from xxx where age > 29 and department in (INF, AI-Lab)
  • 27. OAP In BigSQL … Name Department Age … … … … … … … John INF 35 … … Michelle AI-Lab 29 … … Amy INF 42 … … Kim AI-Lab 27 … … Mary AI-Lab 47 … … … … … … DataFile InMemoryCache Load Cache Department Row Index in Data File INF 2 AI-Lab 3 Age Row Index in Data File 35 0 29 1
  • 28. BBS’s Contribute to Spark • Spark-4502 Spark SQL reads unneccesary nested fields from Parquet • Spark-18700 getCached in HiveMetastoreCatalog not thread safe cause driver OOM • Spark-20408 Get glob path in parallel to reduce resolve relation time • …
  • 29. Agenda • Background for OAP • Key features • Benchmark • OAP and Spark in Baidu • Future plans
  • 30. Future plans • Compatible with more data formats • Explicit cache and cache management • Optimize SQL operators (join, aggregate) with index • Integrate with structured streaming • Utilize Latest hardware technology, such as Intel QAT or 3D XPoint. • Welcome to contribute! https://github.com/Intel-bigdata/OAP
  • 31.