SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
EFFECTIVE SPARK WITH ALLUXIO
Strata + Hadoop World San Jose 2017
Calvin Jia, Alluxio Inc.
OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
2
BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIO
…
…
FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System
Unifying Data at Memory Speed
BIG DATA ECOSYSTEM ISSUES
GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface
3
BIG DATA ECOSYSTEM YESTERDAY
4
• Fastest growing open-
source project in the
big data ecosystem
• 400+ contributors from
100+ organizations
• Running in large
production clusters
• Community members
are welcome!
FASTEST GROWING BIG DATA PROJECTS
Popular Open Source Projects’ Growth
Months
NumberofContributors
OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
5
ACCELERATE I/O TO/FROM REMOTE STORAGE
The performance was amazing. With Spark
SQL alone, it took 100-150 seconds to finish a
query; using Alluxio, where data may hit local
or remote Alluxio nodes, it took 10-15 seconds.
- Baidu
RESULTS
• Data queries are now 30x faster with Alluxio
• Alluxio cluster runs stably, providing over
50TB of RAM space
• By using Alluxio, batch queries usually lasting
over 15 minutes were transformed into an
interactive query taking less than 30 seconds
Baidu’s PMs and analysts run
interactive queries to gain insights
into their products and business
• 200+ nodes deployment
• 2+ petabytes of storage
• Mix of memory + HDD
ALLUXIO
Baidu File System
6
GUARANTEE PERFORMANCE TO MEET SLAS
RESULTS
• Introducing Alluxio provides the system with
smooth performance in an environment with
highly variable load
• Without Alluxio, the performance of critical
jobs could take more than 4x the expected
time
• With Alluxio’s guaranteed performance,
business logic could be reliably be executed
leading to improved user experience
• Real-time machine learning
• 10+ TB of storage
• Mix of Memory + HDD
A leading online retailer uses
Alluxio to compute conversion
attribution
7
ALLUXIO
SHARE DATA ACROSS JOBS @ MEMORY SPEED
Thanks to Alluxio, we now have the raw data
immediately available at every iteration and
we can skip the costs of loading in terms of
time waiting, network traffic, and RDBMS
activity.
- Barclays
RESULTS
• Barclays workflow iteration time decreased
from hours to seconds
• Alluxio enabled workflows that were
impossible before
• By keeping data only in memory, the I/O cost
of loading and storing in Alluxio is now on the
order of seconds
Barclays uses query and machine
learning to train models for risk
management
• 6 node deployment
• 1TB of storage
• Memory only
ALLUXIO
Relational Database
8
TRANSPARENTLY MANAGE DATA ACROSS
STORAGE SYSTEMS
We’ve been running Alluxio in production for
over 9 months, Alluxio’s unified namespace
enable different applications and
frameworks to easily interact with data from
different storage systems.
- Qunar
RESULTS
• Data sharing among Spark Streaming, Spark
batch and Flink jobs provide efficient data
sharing
• Improved the performance of their system
with 15x – 300x speedups
• Tiered storage feature manages storage
resources including memory, SSD and disk
• 200+ nodes deployment
• 6 billion logs (4.5 TB) daily
• Mix of Memory + HDD
ALLUXIO
Qunar uses real-time machine
learning for their website ads.
9
OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
10
CONSOLIDATING MEMORY
11
Storage Engine &
Execution Engine
Same Process
• Two copies of data in memory – double the memory used
• Inter-process Sharing Slowed Down by Network / Disk I/O
Spark Compute
Spark
Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Spark Compute
Spark
Storage
block 1
block 3
CONSOLIDATING MEMORY
12
Storage Engine &
Execution Engine
Different process
• Half the memory used
• Inter-process Sharing Happens at Memory Speed
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block  1
block  3
block  2
block  4
Alluxio
block 1
block 3 block 4
Spark Compute
Spark Storage
DATA RESILIENCE DURING CRASH
13
Spark Compute
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
DATA RESILIENCE DURING CRASH
14
CRASH
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
• Process Crash Requires Network and/or Disk I/O to Re-read
Data
Storage Engine &
Execution Engine
Same Process
DATA RESILIENCE DURING CRASH
15
CRASH
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
• Process Crash Requires Network and/or Disk I/O to Re-read
Data
DATA RESILIENCE DURING CRASH
16
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block  1
block  3
block  2
block  4
Alluxio
block 1
block 3 block 4
Storage Engine &
Execution Engine
Different process
DATA RESILIENCE DURING CRASH
17
• Process Crash – Data is Re-read at Memory Speed
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block  1
block  3
block  2
block  4
Alluxio
block 1
block 3 block 4
CRASH Storage Engine &
Execution Engine
Different process
ACCESSING ALLUXIO DATA FROM SPARK
18
Writing Data Write to an Alluxio file
Reading Data Read from an Alluxio file
CODE EXAMPLE FOR SPARK RDDS
19
Writing RDD to Alluxio
rdd.saveAsTextFile(alluxioPath)
rdd.saveAsObjectFile(alluxioPath)
Reading RDD from
Alluxio
rdd = sc.textFile(alluxioPath)
rdd = sc.objectFile(alluxioPath)
CODE EXAMPLE FOR SPARK DATAFRAMES
20
Writing to Alluxio df.write.parquet(alluxioPath)
Reading from Alluxio df = sc.read.parquet(alluxioPath)
OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
21
ENVIRONMENT
Spark 2.0.0 + Alluxio 1.2.0
Single worker: Amazon r3.2xlarge
Comparisons:
• Alluxio
• Spark Storage Level: MEMORY_ONLY
• Spark Storage Level: MEMORY_ONLY_SER
• Spark Storage Level: DISK_ONLY
22
0
50
100
150
200
250
0 10 20 30 40 50
Time[seconds]
RDD Size [GB]
READING CACHEDRDD
Alluxio (textFile) Alluxio (objectFile) DISK_ONLY
MEMORY_ONLY_SER MEMORY_ONLY
23
24
0 50 100 150 200 250
Alluxio
(textFile)
Alluxio
(objectFile)
No Alluxio
Time [seconds]
NEW CONTEXT: READ 50 GB RDD (SSD)
25
0 100 200 300 400 500 600 700 800
Alluxio
(textFile)
Alluxio
(objectFile)
No Alluxio
Time [seconds]
NEW CONTEXT: READ 50 GB RDD (S3)
7x speedup
16x speedup
26
0
50
100
150
200
250
0 10 20 30 40 50
Time[seconds]
DataFrame Size [GB]
READING CACHED DATAFRAME
(PARQUET)
Alluxio (textFile) MEMORY_ONLY_SER MEMORY_ONLY
27
0 50 100 150 200 250
Alluxio
No Alluxio
Time [seconds]
NEW CONTEXT: READ 50 GB DATAFRAME
(SSD)
28
0 250 500 750 1000 1250 1500 1750
Alluxio
No Alluxio
Time [seconds]
NEW CONTEXT: READ 50 GB DATAFRAME
(S3)
10x average speedup, 17x peak speedup
CONCLUSION
• Easy to use with Spark
• Predictable and improved performance
• Easily connect to various storages
29
Contact: calvin@alluxio.com
Twitter: @jiacalvin
Websites: www.alluxio.com and www.alluxio.org
Thank you!
30
31
Unification
New workflows across any
data in any storage system
Orders of magnitude
improvement in run time
Choice in compute and
storage – grow each
independently, buy only
what is needed
Performance Flexibility
BENEFITS

Contenu connexe

Tendances

Tendances (20)

Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
 
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri Simsa
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
 
Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio: Unify Data at Memory Speed; 2016-11-18Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio: Unify Data at Memory Speed; 2016-11-18
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
 
The Missing Piece of On-Demand Clusters
The Missing Piece of On-Demand ClustersThe Missing Piece of On-Demand Clusters
The Missing Piece of On-Demand Clusters
 
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
 
Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio Presentation at AMPLab Summer Retreat 2016Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio Presentation at AMPLab Summer Retreat 2016
 
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
 
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
 
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed StorageAlluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
 
Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
Alluxio (formerly Tachyon): The Journey thus far and the Road AheadAlluxio (formerly Tachyon): The Journey thus far and the Road Ahead
Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with AlluxioSecurely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
 
Spark Pipelines in the Cloud with Alluxio
Spark Pipelines in the Cloud with AlluxioSpark Pipelines in the Cloud with Alluxio
Spark Pipelines in the Cloud with Alluxio
 
Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016
 
Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016
 
Open Source Memory Speed Virtual Distributed Storage
Open Source Memory Speed Virtual Distributed StorageOpen Source Memory Speed Virtual Distributed Storage
Open Source Memory Speed Virtual Distributed Storage
 
Running Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with AlluxioRunning Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with Alluxio
 
Alluxio-FUSE as a data access layer for Dask
Alluxio-FUSE as a data access layer for DaskAlluxio-FUSE as a data access layer for Dask
Alluxio-FUSE as a data access layer for Dask
 

En vedette

En vedette (19)

Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3
 
Immobilization techniques for dna biosensor lecture 6
Immobilization techniques for dna biosensor lecture 6Immobilization techniques for dna biosensor lecture 6
Immobilization techniques for dna biosensor lecture 6
 
Geopolitica y narcotrafico web
Geopolitica y narcotrafico webGeopolitica y narcotrafico web
Geopolitica y narcotrafico web
 
cell-based-biosensor-lecture-7
cell-based-biosensor-lecture-7cell-based-biosensor-lecture-7
cell-based-biosensor-lecture-7
 
Gantt 1º semestre 2017
Gantt 1º semestre 2017Gantt 1º semestre 2017
Gantt 1º semestre 2017
 
MUSEO TECNOLOGICO
MUSEO TECNOLOGICOMUSEO TECNOLOGICO
MUSEO TECNOLOGICO
 
Sistema kanban justo a tiempo
Sistema kanban justo a tiempoSistema kanban justo a tiempo
Sistema kanban justo a tiempo
 
La crisis de la restauración (1902 1931)
La crisis de la restauración (1902 1931)La crisis de la restauración (1902 1931)
La crisis de la restauración (1902 1931)
 
Prisma 6
Prisma 6Prisma 6
Prisma 6
 
introduction-to-biochips-lecture-8
 introduction-to-biochips-lecture-8 introduction-to-biochips-lecture-8
introduction-to-biochips-lecture-8
 
Funciones de demanda y oferta en equilibrio
Funciones de demanda y oferta en equilibrioFunciones de demanda y oferta en equilibrio
Funciones de demanda y oferta en equilibrio
 
GSK Oncology 2017
GSK Oncology 2017GSK Oncology 2017
GSK Oncology 2017
 
Yace 3.0
Yace 3.0Yace 3.0
Yace 3.0
 
El aborto
El aborto El aborto
El aborto
 
DNA-biosensors lecture 5
DNA-biosensors lecture 5DNA-biosensors lecture 5
DNA-biosensors lecture 5
 
Contaminación en los ríos de cali
Contaminación en los ríos de caliContaminación en los ríos de cali
Contaminación en los ríos de cali
 
Caja de herramientas de seguridad digital
Caja de herramientas de seguridad digitalCaja de herramientas de seguridad digital
Caja de herramientas de seguridad digital
 
Peegar: a new prototyping starter kit for everyone
Peegar: a new prototyping starter kit for everyonePeegar: a new prototyping starter kit for everyone
Peegar: a new prototyping starter kit for everyone
 
Thompson libro del primer grado
Thompson libro del primer gradoThompson libro del primer grado
Thompson libro del primer grado
 

Similaire à Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

Spark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Pipelines in the Cloud with Alluxio by Bin FanSpark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Pipelines in the Cloud with Alluxio by Bin Fan
Data Con LA
 

Similaire à Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017 (20)

Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri Simsa
 
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
 
Speeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China UnicomSpeeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China Unicom
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+Alluxio
 
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
 
Spark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Pipelines in the Cloud with Alluxio by Bin FanSpark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Pipelines in the Cloud with Alluxio by Bin Fan
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene Pang
 
Alluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle MeetupAlluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle Meetup
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio
 
Webinar: Cloud Archiving – Amazon Glacier, Microsoft Azure or Something Else?
Webinar: Cloud Archiving – Amazon Glacier, Microsoft Azure or Something Else?Webinar: Cloud Archiving – Amazon Glacier, Microsoft Azure or Something Else?
Webinar: Cloud Archiving – Amazon Glacier, Microsoft Azure or Something Else?
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
10 Reasons Snowflake Is Great for Analytics
10 Reasons Snowflake Is Great for Analytics10 Reasons Snowflake Is Great for Analytics
10 Reasons Snowflake Is Great for Analytics
 
Open Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and CloudOpen Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and Cloud
 
Building Fast SQL Analytics on Anything with Presto, Alluxio
Building Fast SQL Analytics on Anything with Presto, AlluxioBuilding Fast SQL Analytics on Anything with Presto, Alluxio
Building Fast SQL Analytics on Anything with Presto, Alluxio
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 

Plus de Alluxio, Inc.

Plus de Alluxio, Inc. (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 

Dernier

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 

Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

  • 1. EFFECTIVE SPARK WITH ALLUXIO Strata + Hadoop World San Jose 2017 Calvin Jia, Alluxio Inc.
  • 2. OUTLINE • Alluxio Overview • Alluxio + Spark Use Cases • Using Alluxio with Spark • Performance Evaluation 2
  • 3. BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIO … … FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System Unifying Data at Memory Speed BIG DATA ECOSYSTEM ISSUES GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface 3 BIG DATA ECOSYSTEM YESTERDAY
  • 4. 4 • Fastest growing open- source project in the big data ecosystem • 400+ contributors from 100+ organizations • Running in large production clusters • Community members are welcome! FASTEST GROWING BIG DATA PROJECTS Popular Open Source Projects’ Growth Months NumberofContributors
  • 5. OUTLINE • Alluxio Overview • Alluxio + Spark Use Cases • Using Alluxio with Spark • Performance Evaluation 5
  • 6. ACCELERATE I/O TO/FROM REMOTE STORAGE The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds. - Baidu RESULTS • Data queries are now 30x faster with Alluxio • Alluxio cluster runs stably, providing over 50TB of RAM space • By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds Baidu’s PMs and analysts run interactive queries to gain insights into their products and business • 200+ nodes deployment • 2+ petabytes of storage • Mix of memory + HDD ALLUXIO Baidu File System 6
  • 7. GUARANTEE PERFORMANCE TO MEET SLAS RESULTS • Introducing Alluxio provides the system with smooth performance in an environment with highly variable load • Without Alluxio, the performance of critical jobs could take more than 4x the expected time • With Alluxio’s guaranteed performance, business logic could be reliably be executed leading to improved user experience • Real-time machine learning • 10+ TB of storage • Mix of Memory + HDD A leading online retailer uses Alluxio to compute conversion attribution 7 ALLUXIO
  • 8. SHARE DATA ACROSS JOBS @ MEMORY SPEED Thanks to Alluxio, we now have the raw data immediately available at every iteration and we can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity. - Barclays RESULTS • Barclays workflow iteration time decreased from hours to seconds • Alluxio enabled workflows that were impossible before • By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds Barclays uses query and machine learning to train models for risk management • 6 node deployment • 1TB of storage • Memory only ALLUXIO Relational Database 8
  • 9. TRANSPARENTLY MANAGE DATA ACROSS STORAGE SYSTEMS We’ve been running Alluxio in production for over 9 months, Alluxio’s unified namespace enable different applications and frameworks to easily interact with data from different storage systems. - Qunar RESULTS • Data sharing among Spark Streaming, Spark batch and Flink jobs provide efficient data sharing • Improved the performance of their system with 15x – 300x speedups • Tiered storage feature manages storage resources including memory, SSD and disk • 200+ nodes deployment • 6 billion logs (4.5 TB) daily • Mix of Memory + HDD ALLUXIO Qunar uses real-time machine learning for their website ads. 9
  • 10. OUTLINE • Alluxio Overview • Alluxio + Spark Use Cases • Using Alluxio with Spark • Performance Evaluation 10
  • 11. CONSOLIDATING MEMORY 11 Storage Engine & Execution Engine Same Process • Two copies of data in memory – double the memory used • Inter-process Sharing Slowed Down by Network / Disk I/O Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Spark Compute Spark Storage block 1 block 3
  • 12. CONSOLIDATING MEMORY 12 Storage Engine & Execution Engine Different process • Half the memory used • Inter-process Sharing Happens at Memory Speed Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block  1 block  3 block  2 block  4 Alluxio block 1 block 3 block 4 Spark Compute Spark Storage
  • 13. DATA RESILIENCE DURING CRASH 13 Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process
  • 14. DATA RESILIENCE DURING CRASH 14 CRASH Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 • Process Crash Requires Network and/or Disk I/O to Re-read Data Storage Engine & Execution Engine Same Process
  • 15. DATA RESILIENCE DURING CRASH 15 CRASH HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process • Process Crash Requires Network and/or Disk I/O to Re-read Data
  • 16. DATA RESILIENCE DURING CRASH 16 Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block  1 block  3 block  2 block  4 Alluxio block 1 block 3 block 4 Storage Engine & Execution Engine Different process
  • 17. DATA RESILIENCE DURING CRASH 17 • Process Crash – Data is Re-read at Memory Speed HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block  1 block  3 block  2 block  4 Alluxio block 1 block 3 block 4 CRASH Storage Engine & Execution Engine Different process
  • 18. ACCESSING ALLUXIO DATA FROM SPARK 18 Writing Data Write to an Alluxio file Reading Data Read from an Alluxio file
  • 19. CODE EXAMPLE FOR SPARK RDDS 19 Writing RDD to Alluxio rdd.saveAsTextFile(alluxioPath) rdd.saveAsObjectFile(alluxioPath) Reading RDD from Alluxio rdd = sc.textFile(alluxioPath) rdd = sc.objectFile(alluxioPath)
  • 20. CODE EXAMPLE FOR SPARK DATAFRAMES 20 Writing to Alluxio df.write.parquet(alluxioPath) Reading from Alluxio df = sc.read.parquet(alluxioPath)
  • 21. OUTLINE • Alluxio Overview • Alluxio + Spark Use Cases • Using Alluxio with Spark • Performance Evaluation 21
  • 22. ENVIRONMENT Spark 2.0.0 + Alluxio 1.2.0 Single worker: Amazon r3.2xlarge Comparisons: • Alluxio • Spark Storage Level: MEMORY_ONLY • Spark Storage Level: MEMORY_ONLY_SER • Spark Storage Level: DISK_ONLY 22
  • 23. 0 50 100 150 200 250 0 10 20 30 40 50 Time[seconds] RDD Size [GB] READING CACHEDRDD Alluxio (textFile) Alluxio (objectFile) DISK_ONLY MEMORY_ONLY_SER MEMORY_ONLY 23
  • 24. 24 0 50 100 150 200 250 Alluxio (textFile) Alluxio (objectFile) No Alluxio Time [seconds] NEW CONTEXT: READ 50 GB RDD (SSD)
  • 25. 25 0 100 200 300 400 500 600 700 800 Alluxio (textFile) Alluxio (objectFile) No Alluxio Time [seconds] NEW CONTEXT: READ 50 GB RDD (S3) 7x speedup 16x speedup
  • 26. 26 0 50 100 150 200 250 0 10 20 30 40 50 Time[seconds] DataFrame Size [GB] READING CACHED DATAFRAME (PARQUET) Alluxio (textFile) MEMORY_ONLY_SER MEMORY_ONLY
  • 27. 27 0 50 100 150 200 250 Alluxio No Alluxio Time [seconds] NEW CONTEXT: READ 50 GB DATAFRAME (SSD)
  • 28. 28 0 250 500 750 1000 1250 1500 1750 Alluxio No Alluxio Time [seconds] NEW CONTEXT: READ 50 GB DATAFRAME (S3) 10x average speedup, 17x peak speedup
  • 29. CONCLUSION • Easy to use with Spark • Predictable and improved performance • Easily connect to various storages 29
  • 30. Contact: calvin@alluxio.com Twitter: @jiacalvin Websites: www.alluxio.com and www.alluxio.org Thank you! 30
  • 31. 31 Unification New workflows across any data in any storage system Orders of magnitude improvement in run time Choice in compute and storage – grow each independently, buy only what is needed Performance Flexibility BENEFITS