SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Agenda
• Background and motivation
• Bigdata analytics on the cloud: the challenges & optimizations
• Accelerate bigdata analytics Alluxio
• Summary
2
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
BOUNDED Storage and Compute resources on Hadoop Nodes brings challenges
Data/Capacity
Upgrade Cost
Space, Power, Utilization
Multiple Storage Silos
Inadequate Performance
Typical Challenges
Costs
Provisioning and Configuration
Performance
& efficiency
Data Capacity Silos
Challenges of scaling Hadoop* Storage
*Other names and brands may be claimed as the property of others.
3
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
4 big trends driving the need for a new architecture
Separation of
Compute &
Storage
Hybrid – Multi
cloud
environments
Self-service
data across the
enterprise
Rise
of the object
store
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Discontinuity in bigdata infrastructure makes
different solution
Get a bigger cluster
for many teams to share.
Give each team their
own dedicated cluster,
each with a copy of
PBs of data.
Give teams ability to
spin-up/spin-down
clusters which can
share data sets.
SINGLE LARGE CLUSTER MULTIPLE SMALL CLUSTERS ON DEMAND ANALYTIC CLUSTERS
5
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
6
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
7
Storage Disaggregation architecture
• Replace HDFS with Shared data lake
• Enables independent scale of compute and storage
• But does this architecture works?
Shared Data Lake
Batch Streaming Interactive Machine Leaning
Graph
Analytics
Compute
Storage
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Functionality challenges: trouble-shooting on
configurations
0%
20%
40%
60%
80%
100%
120%
hive-parquet spark-parquet presto-parquet (untuned) presto-parquet (tuned)
QuerySuccess%
1TB Query Success % (54 TPC-DS Queries)
0%
20%
40%
60%
80%
100%
120%
spark-parquet spark-orc presto-parquet presto-parquet
1TB&10TB Query Success %(54 TPC-DS Queries)
0 2 4 6 8 10 12 14 16
Ceph issue
Compatible issue
Deployment issue
Improper default configuration
Middleware issue
Runtime issue
S3a driver issue
Count of Issue Type
• Lots of tunings & trouble shootings required to achieve 100%
success ratio for selected TPC-DS queries
• Improper Default configuration
• Wrong middleware configuration
• Improper Hadoop/Spark configuration for different size and
format data issues
tuned
8
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Deployment Architecture challenges: Multiple choices
Deployment Architecture depends on detail HW configuration, application and cost requirements
9
Architecture 1 Architecture 2 Architecture 3
Architecture 4 Architecture 5
1: Dedicated Load Balance
2: Round Robin DNS and dedicated gateway
3: Round Robin DNS, gateway co-located with
Storage node
4: Fully disaggregated architecture, multiple
storage solutions on the disaggregated storage
node
5: Alluxio cache layer deployed on the compute
node
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
S3A Ceph
cloud adaptor challenges: S3a tunings an example
S3A performance is dramatically
worse than HDFS
Remote HDFSS3A Ceph
stage 0 stage > 0
stage 0 stage > 0
stage 0 stage > 0
Took 820secs with
BW of 120MB/s
• From disk io and network
io data, we can see read
Bandwidth on Ceph is
extremely low, about
100MB/s vs. 3GB/s on
HDFS.
• And based on our
experience, Ceph is
capable drive disk BW to
more than 2GB/s.
• S3A adaptor is the
bottleneck.
<property>
<name>fs.s3a.readahead.range</name>
<value>1024K</value>
</property>
<property>
<name>fs.s3a.experimental.input.fadvise</name>
<value>random</value>
</property>
Tuning up readahead range will
decrease S3A opened connections.
S3A 11.5x improved!!
10
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Ingest
cluster
ETL
Transform
ation
cluster
Compute resource pool
Disaggregated Storage
Simulating typical usage cases:
Simple Read/Write
§ Terasort: a popular benchmark that measures the
amount of time to sort one terabyte of randomly
distributed data on a given computer system.
TPC-DS derived tests:
Batch Analytics
§ To consistently executing analytical process to process
large set of data.
§ UC11: Leveraging 54 derived from TPC-DS * queries
with intensive reads across objects in different buckets
§ I/O intensive queries: selected 9 I/O intensive queries
from TPC-DS
Kmeans
§ K-means is one of the most commonly used clustering
algorithms that clusters the data points into a
predefined number of clusters.
Performance gaps: usage cases
11
Batch
query
cluster
Interactive
query
cluster
Machine
Learning
cluster
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Performance gaps with storage disaggregation
• Storage disaggregation leads to performance regression
• Up to 10% for remote HDFS, Terasort performance is higher as usable memory increased
• Up to 60% for S3 object storage (optimized results, up to 11.5x perf. boost through tunings compared with default parameters)
• One important cause for the performance gap: s3a does not support Transactional Writes
• Most of bigdata software (Spark, Hive) relies on HDFS’s atomic rename feature to support atomic writes
• During job submit, commit protocol is used to specify how results should be written at the end of job, first stage task output into temporary
locations, and only moving (renaming) data to final location upon task or job completion
• S3a implements this with: COPY+DELETE+HEAD+POST
• Despite there are some on-going efforts to optimize s3a adaptor, there is no near-term solution for the performance gap
1.0 1.0 1.0 1.0
0.9 0.9
1.1
0.9
0.7
0.6
0.4
0.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Batch Query (54quiries) IO INTENSIVE(7 quiries) TERASORT 1T KMEANS 374g
Performance Comparision of Disaggregated analytics storage
spark(yarn) + Local HDFS (HDD) spark(yarn) + Remote HDFS (HDD) spark(yarn) + S3 (HDD)
higher is better
12
Need to close the performance gap!
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Alluxio based IN Memory data accelerator
(IMDA)
Shared Data Lake with s3a object storage
Batch Streaming Interactive Machine Leaning
Graph
Analytics
Shared Data Lake with s3a object storage
Batch Streaming Interactive Machine
Leaning
Graph
Analytics
Provisioned Compute Pool
In Memory Data Acclerator
Replace HDFS with disaggregated s3 object storage Alluxio based In Memory Accleration layer
13
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Independent scaling of compute & storage
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Persistent Memory and RDMA
Persistent Memory:
• PMEM represents a new class of memory and storage technology
architected specifically for data center usage
• Combination of high-capacity, affordability and persistence.
RDMA: Remote Direct Memory Access
• Accessing (i.e. reading from or writing to) memory on a
remote machine without interrupting the processing of the
CPU(s) on that system.
• Zero-copy - applications perform data transfer without
the network software stack involvement, data is being
send received directly to the buffers without being
copied between the network layers.
• Kernel bypass - applications perform data transfer
directly from userspace, no context switches.
• No CPU involvement - applications can access remote
memory without consuming any CPU in the remote
machine.
Picture source: https://software.intel.com/en-us/blogs/2018/10/30/intel-optane-dc-persistent-memory-a-major-advance-in-memory-and-storage-architecture
15
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Persistent Memory Operations Mode
IMC
Cascade Lake
IMC
• 128, 256, 512GB
DIMM Capacity
• 2666 MT/sec
Speed
• 3TB (not including DRAM)
Capacity per CPU
Flexible, Usage Specific Partitions
Non-Volatile Memory Pool
DDR4 DRAM*
DCPMM*
AppDirect
Storage
Memory
• DDR4 electrical & physical
• Close to DRAM latency
• Cache line size access
DRAM, or
DRAM as
cache
* DIMM population shown as an example only.
1 MEMORY mode
Storage over APP DIRECT
● Large memory at lower cost
● Low latency persistent memory
● Fast direct-attach storage
● Persistent data for rapid recovery2
APP DIRECT mode
16
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Leveraging In memory data accelerator to accelerate
intermediate data access
Applications
Disaggregated
Storage
Hbase*Ceph*
Resource Mgmt
& Co-ordination
ZooKeeper*YARN*
Data
Processing
& Analysis
MR*
Storm*
Parquet* Avro*
Spark Core
SQL* Streaming* Mllib* GraphX*
DataFrame
ML Pipelines
SparkR*
Flink*
Giraph*
Batch StreamingInteractive Machine
Leaning
Graph
Analytics
HDFS* OSS*
Acceleration Layer Alluxio*
• Leverage new HW technologies & products that delivers
significant performance improvement
• Persistent memory, RDMA
• Using Alluxio based in memory data accelerator layer to
accelerate ephemeral data access
• Caching hot data in Alluxio shorten I/O stack
• Unifies underlying Filesystem
• It requires a storage and network co-design to fully leverage
those technologies or HWs address the bottlenecks
• Optimized libraries to bypass filesystem, avoid user
space/kernel space context switch
…
k8s*
High Speed Networking
17
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
In memory data accelerator (IMDA) architecture
Enable Alluxio with state-of-art HW technology
§ Alluxio as a light-weighted user space I/O based distributed data store
– For ephemeral data access like cache, shuffle, spill
– Tiered storage – DRAM, persistent memory and SSDa
§ Persistent Memory to enlarge compute storage with high performance and low
cost
§ RDMA to avoid context switch, kernel bypass
– Persistent Memory mmap address as RDMA buffer to avoid memory
copies
– Persistent Memory as off-heap memory to improve GC
§ Long term: Customized shuffler for shuffle data, spill data to Alluxio IMDA
Shared Data Lake with s3a object storage
Batch Streaming Interactive Machine
Leaning
Graph
Analytics
Provisioned Compute Pool
in memory data accelerator
Ephemeral data
RDMA enabled Network
Persistent MemoryDRAM
NVMe SSD
18
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Alluxio IMDA system configuration 5x Compute Node
Hardware:
• ntel® Xeon™ processor Gold 6140 @
2.3GHz, 384GB Memory
• 1x 82599 10Gb NIC
• 5x P4500 SSD (2 for spark-shuffle)
Software:
• Hadoop 2.8.1
• Spark 2.2.0
• Hive 2.2.1
• RHEL7.3
5x Storage Node
• Intel(R) Xeon(R) CPU Gold 6140 @
2.30GHz, 192GB Memory
• 2x 82599 10Gb NIC
• 7x 1TB HDD for Ceph bluestore or HDFS
namenode and datanode
Software:
• Hadoop 2.8.1
• Ceph 12.2.7
• RHEL7.3
*Other names and brands may be
claimed as the property of others. 19
Hadoop
Hive
Spark
Alluxio
DNS Hadoop
Hive
Spark
Hadoop
Hive
Spark
Hadoop
Hive
Spark
Hadoop
Hive
Spark
CEPH
MON
RGW
REMOTE
HDFS
NN
1x10Gb NIC
Alluxio Alluxio Alluxio Alluxio
OSD DN
CEPH
RGW
REMOTE
HDFS
OSD DN
CEPH
RGW
REMOTE
HDFS
OSD DN
CEPH
RGW
REMOTE
HDFS
OSD DN
CEPH
RGW
REMOTE
HDFS
OSD DN
Alluxio Acceleration Layer
• 200GB Mem for mem mode
Software:
• Alluxio 1.7.0
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Alluxio IMDA Performance
Using Alluxio IMDA as cache:
• For terasort, 3.4x speedup over S3 object storage, 1.36x speedup over local HDFS.
• For TPCDS test, up to 1.56x performance speedup for IO intensive queries, slightly lower than local HDFS.
• For KMeans test, 1.62x speedup over S3 object storage, 14% lower compared with local HDFS.
• KMeans is a CPU intensive workload
1.00 1.00 1.00 1.00
0.70
0.62
0.40
0.53
0.96 0.97
1.36
0.86
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Batch Query (54quiries) IO INTENSIVE(7 quiries) TERASORT 1T KMEANS 374g
Alluxio Acceleration of Disaggregated analytics storage
spark(yarn) + Local HDFS (HDD) spark(yarn) + S3 (HDD) spark(yarn) +alluxio(MEM) + S3 (HDD)
higher is better
Using Alluxio IMDA cache improved in IO intensive workloads
20
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
21
Alluxio DCPMM Tier architecture
Alluxio PMEM tier
• A new PMEM tier layer introduced to provide
higher performance with lower cost
• Large Capacity -> Cache more data
• Higher performance compared with NVMe SSD
• Leverage PMDK lib to bypass filesystem
overhead and context switches
• Deliver dedicated SLA to mission critical
applications
DRAM
DCPMM
SSD
HDD Under Storage
Application
s
Alluxio
Worker
Alluxio
Master
Alluxio
Client
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
22
New PMEM tier for Alluxio
Two modes to support different usage scenario
• SoAD mode
• No code changes.
• Bypass pagecache
• PMDK AD mode
• Bypass pagecache & no context switches
• Better cache load performance.
DCPMM
PMDK based AD mode
worker
POSIX
filesystem
Pagecache
DAX
filesystem
Memory mapped
file
client
Context Switches
POSIX Context Switches
workerclient workerclient
Userspace
Load/Store
JNI
Storage over App Direct
(SoAD) Mode
PMDK
Use Cases Alluxio Enables
Burst big data workloads in
hybrid cloud environments
Same instance
/ container
Accelerate big data frameworks
on the public cloud
Same instance
/ container
Dramatically speed-up big data
on object stores on premise
Same container
/ machine
or or
Alluxio
Presto
Alluxio
Presto
Alluxio
Presto
Alluxio
PrestoHive
Alluxio
Hive
Alluxio
Hive
Alluxio
Hive
Alluxio
Alluxio
Spark
AlluxioAlluxio
Spark
Alluxio
SparkSpark
Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data on-demand with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Alluxio – Key innovations
Data Locality with Intelligent Multi-tiering
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
Data Accessibility via popular APIs and API Translation
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift DriverS3 Driver NFS Driver
Data Elasticity via Unified Namespace
Enables effective data management across different Under Store
- Uses Mounting withTransparent Naming
Alluxio
MasterZookeeper /
RAFT
Standby
Master
WAN
Alluxio
Client
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
Alluxio Reference Architecture
…
…
Application
Application
Under Store 1
Under Store 2
Enterprises moving towards independent compute & storage
Learn more
Incredible Open Source Momentum with growing community
1000+ contributors &
growing
4000+ Git Stars
Apache 2.0 Licensed
Hundreds of thousands
of downloads
Join the conversation on Slack
alluxio.org/slack
Questions?
Join the Alluxio Community
www.alluxio.org | www.alluxio.com | @alluxio
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
32
Call to action
• Stay tuned for further updates
• More details
• Speeding Big Data Analytics on the Cloud with an In-Memory Data Accelerator
• https://www.alluxio.io/blog/speeding-big-data-analytics-on-the-cloud-with-in-
memory-data-accelerator/
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Legal Information: Benchmark and Performance
DisclaimersPerformance results are based on testing as of Feb. 2019 and may not reflect all publicly available security updates. See
configuration disclosure for details. No product can be absolutely secure.
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You
should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products. For more information, see Performance
Benchmark Test Disclosure.
Configurations: see performance benchmark test configurations.
33
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage

Contenu connexe

Tendances

Ac922 watson 180208 v1
Ac922 watson 180208 v1Ac922 watson 180208 v1
Ac922 watson 180208 v1IBM Sverige
 
The Importance of Fast, Scalable Storage for Today’s HPC
The Importance of Fast, Scalable Storage for Today’s HPCThe Importance of Fast, Scalable Storage for Today’s HPC
The Importance of Fast, Scalable Storage for Today’s HPCIntel IT Center
 
Dell Storage Management
Dell Storage ManagementDell Storage Management
Dell Storage ManagementDell World
 
Expert Guide to Migrating Legacy Databases to Postgres
Expert Guide to Migrating Legacy Databases to PostgresExpert Guide to Migrating Legacy Databases to Postgres
Expert Guide to Migrating Legacy Databases to PostgresEDB
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceOdinot Stanislas
 
In-Place analytics with Unified Data Access
In-Place analytics with Unified Data AccessIn-Place analytics with Unified Data Access
In-Place analytics with Unified Data AccessDataWorks Summit
 
Appliance Launch Webcast
Appliance Launch WebcastAppliance Launch Webcast
Appliance Launch WebcastGina Tragos
 
Business Track session 2: udp solution selling made simple
Business Track session 2: udp solution selling made simpleBusiness Track session 2: udp solution selling made simple
Business Track session 2: udp solution selling made simplearcserve data protection
 
Best Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQLBest Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQLEDB
 
An overview of reference architectures for Postgres
An overview of reference architectures for PostgresAn overview of reference architectures for Postgres
An overview of reference architectures for PostgresEDB
 
Ibm pure data system for analytics n200x
Ibm pure data system for analytics n200xIbm pure data system for analytics n200x
Ibm pure data system for analytics n200xIBM Sverige
 
EMC Starter Kit - IBM BigInsights - EMC Isilon
EMC Starter Kit - IBM BigInsights - EMC IsilonEMC Starter Kit - IBM BigInsights - EMC Isilon
EMC Starter Kit - IBM BigInsights - EMC IsilonBoni Bruno
 
Intel Ethernet 800 Series Network Adapters in Dell EMC PowerEdge R740xd serve...
Intel Ethernet 800 Series Network Adapters in Dell EMC PowerEdge R740xd serve...Intel Ethernet 800 Series Network Adapters in Dell EMC PowerEdge R740xd serve...
Intel Ethernet 800 Series Network Adapters in Dell EMC PowerEdge R740xd serve...Principled Technologies
 
Automating a PostgreSQL High Availability Architecture with Ansible
Automating a PostgreSQL High Availability Architecture with AnsibleAutomating a PostgreSQL High Availability Architecture with Ansible
Automating a PostgreSQL High Availability Architecture with AnsibleEDB
 
Netezza vs Teradata vs Exadata
Netezza vs Teradata vs ExadataNetezza vs Teradata vs Exadata
Netezza vs Teradata vs ExadataAsis Mohanty
 
Commercial track 3_UDP Licensing Pricing & Support Made Simple
Commercial track 3_UDP Licensing Pricing & Support Made SimpleCommercial track 3_UDP Licensing Pricing & Support Made Simple
Commercial track 3_UDP Licensing Pricing & Support Made Simplearcserve data protection
 

Tendances (20)

Ac922 watson 180208 v1
Ac922 watson 180208 v1Ac922 watson 180208 v1
Ac922 watson 180208 v1
 
EMC config Hadoop
EMC config HadoopEMC config Hadoop
EMC config Hadoop
 
The Importance of Fast, Scalable Storage for Today’s HPC
The Importance of Fast, Scalable Storage for Today’s HPCThe Importance of Fast, Scalable Storage for Today’s HPC
The Importance of Fast, Scalable Storage for Today’s HPC
 
Dell Storage Management
Dell Storage ManagementDell Storage Management
Dell Storage Management
 
Expert Guide to Migrating Legacy Databases to Postgres
Expert Guide to Migrating Legacy Databases to PostgresExpert Guide to Migrating Legacy Databases to Postgres
Expert Guide to Migrating Legacy Databases to Postgres
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application Performance
 
In-Place analytics with Unified Data Access
In-Place analytics with Unified Data AccessIn-Place analytics with Unified Data Access
In-Place analytics with Unified Data Access
 
Appliance Launch Webcast
Appliance Launch WebcastAppliance Launch Webcast
Appliance Launch Webcast
 
Business Track session 2: udp solution selling made simple
Business Track session 2: udp solution selling made simpleBusiness Track session 2: udp solution selling made simple
Business Track session 2: udp solution selling made simple
 
Commercial track 1_The Power of UDP
Commercial track 1_The Power of UDPCommercial track 1_The Power of UDP
Commercial track 1_The Power of UDP
 
Best Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQLBest Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQL
 
An overview of reference architectures for Postgres
An overview of reference architectures for PostgresAn overview of reference architectures for Postgres
An overview of reference architectures for Postgres
 
Ibm pure data system for analytics n200x
Ibm pure data system for analytics n200xIbm pure data system for analytics n200x
Ibm pure data system for analytics n200x
 
EMC Starter Kit - IBM BigInsights - EMC Isilon
EMC Starter Kit - IBM BigInsights - EMC IsilonEMC Starter Kit - IBM BigInsights - EMC Isilon
EMC Starter Kit - IBM BigInsights - EMC Isilon
 
How to choose the right server
How to choose the right serverHow to choose the right server
How to choose the right server
 
Intel Ethernet 800 Series Network Adapters in Dell EMC PowerEdge R740xd serve...
Intel Ethernet 800 Series Network Adapters in Dell EMC PowerEdge R740xd serve...Intel Ethernet 800 Series Network Adapters in Dell EMC PowerEdge R740xd serve...
Intel Ethernet 800 Series Network Adapters in Dell EMC PowerEdge R740xd serve...
 
Automating a PostgreSQL High Availability Architecture with Ansible
Automating a PostgreSQL High Availability Architecture with AnsibleAutomating a PostgreSQL High Availability Architecture with Ansible
Automating a PostgreSQL High Availability Architecture with Ansible
 
Netezza vs Teradata vs Exadata
Netezza vs Teradata vs ExadataNetezza vs Teradata vs Exadata
Netezza vs Teradata vs Exadata
 
Super cluster oracleday cl 7
Super cluster oracleday cl 7Super cluster oracleday cl 7
Super cluster oracleday cl 7
 
Commercial track 3_UDP Licensing Pricing & Support Made Simple
Commercial track 3_UDP Licensing Pricing & Support Made SimpleCommercial track 3_UDP Licensing Pricing & Support Made Simple
Commercial track 3_UDP Licensing Pricing & Support Made Simple
 

Similaire à Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage

Light-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryLight-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryDataWorks Summit
 
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...Alluxio, Inc.
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Databricks
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Etu Solution
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongCeph Community
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryDatabricks
 
A5 oracle exadata-the game changer for online transaction processing data w...
A5   oracle exadata-the game changer for online transaction processing data w...A5   oracle exadata-the game changer for online transaction processing data w...
A5 oracle exadata-the game changer for online transaction processing data w...Dr. Wilfred Lin (Ph.D.)
 
Simplify IT: Oracle SuperCluster
Simplify IT: Oracle SuperCluster Simplify IT: Oracle SuperCluster
Simplify IT: Oracle SuperCluster Fran Navarro
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015Daniela Zuppini
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weitingWei Ting Chen
 
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...Filipe Miranda
 
G108277 ds8000-resiliency-lagos-v1905c
G108277 ds8000-resiliency-lagos-v1905cG108277 ds8000-resiliency-lagos-v1905c
G108277 ds8000-resiliency-lagos-v1905cTony Pearson
 
DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...
DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...
DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...inside-BigData.com
 
Webinar: The Bifurcation of the Flash Market
Webinar: The Bifurcation of the Flash MarketWebinar: The Bifurcation of the Flash Market
Webinar: The Bifurcation of the Flash MarketStorage Switzerland
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...HostedbyConfluent
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Ontico
 

Similaire à Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage (20)

Light-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryLight-weighted HDFS disaster recovery
Light-weighted HDFS disaster recovery
 
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
 
DDN Product Update from SC13
DDN Product Update from SC13DDN Product Update from SC13
DDN Product Update from SC13
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
 
A5 oracle exadata-the game changer for online transaction processing data w...
A5   oracle exadata-the game changer for online transaction processing data w...A5   oracle exadata-the game changer for online transaction processing data w...
A5 oracle exadata-the game changer for online transaction processing data w...
 
Simplify IT: Oracle SuperCluster
Simplify IT: Oracle SuperCluster Simplify IT: Oracle SuperCluster
Simplify IT: Oracle SuperCluster
 
@IBM Power roadmap 8
@IBM Power roadmap 8 @IBM Power roadmap 8
@IBM Power roadmap 8
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
Session 307 ravi pendekanti engineered systems
Session 307  ravi pendekanti engineered systemsSession 307  ravi pendekanti engineered systems
Session 307 ravi pendekanti engineered systems
 
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
 
G108277 ds8000-resiliency-lagos-v1905c
G108277 ds8000-resiliency-lagos-v1905cG108277 ds8000-resiliency-lagos-v1905c
G108277 ds8000-resiliency-lagos-v1905c
 
DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...
DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...
DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...
 
Webinar: The Bifurcation of the Flash Market
Webinar: The Bifurcation of the Flash MarketWebinar: The Bifurcation of the Flash Market
Webinar: The Bifurcation of the Flash Market
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
 

Plus de Alluxio, Inc.

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioAlluxio, Inc.
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingAlluxio, Inc.
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio, Inc.
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...Alluxio, Inc.
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionAlluxio, Inc.
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeAlluxio, Inc.
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudAlluxio, Inc.
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderAlluxio, Inc.
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionAlluxio, Inc.
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio, Inc.
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...Alluxio, Inc.
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAlluxio, Inc.
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...Alluxio, Inc.
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...Alluxio, Inc.
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAlluxio, Inc.
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAlluxio, Inc.
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio, Inc.
 

Plus de Alluxio, Inc. (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 

Dernier

tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benonimasabamasaba
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 

Dernier (20)

tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 

Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage

  • 1.
  • 2. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Agenda • Background and motivation • Bigdata analytics on the cloud: the challenges & optimizations • Accelerate bigdata analytics Alluxio • Summary 2
  • 3. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice BOUNDED Storage and Compute resources on Hadoop Nodes brings challenges Data/Capacity Upgrade Cost Space, Power, Utilization Multiple Storage Silos Inadequate Performance Typical Challenges Costs Provisioning and Configuration Performance & efficiency Data Capacity Silos Challenges of scaling Hadoop* Storage *Other names and brands may be claimed as the property of others. 3
  • 4. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 4 big trends driving the need for a new architecture Separation of Compute & Storage Hybrid – Multi cloud environments Self-service data across the enterprise Rise of the object store
  • 5. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Discontinuity in bigdata infrastructure makes different solution Get a bigger cluster for many teams to share. Give each team their own dedicated cluster, each with a copy of PBs of data. Give teams ability to spin-up/spin-down clusters which can share data sets. SINGLE LARGE CLUSTER MULTIPLE SMALL CLUSTERS ON DEMAND ANALYTIC CLUSTERS 5
  • 6. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 6
  • 7. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 7 Storage Disaggregation architecture • Replace HDFS with Shared data lake • Enables independent scale of compute and storage • But does this architecture works? Shared Data Lake Batch Streaming Interactive Machine Leaning Graph Analytics Compute Storage
  • 8. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Functionality challenges: trouble-shooting on configurations 0% 20% 40% 60% 80% 100% 120% hive-parquet spark-parquet presto-parquet (untuned) presto-parquet (tuned) QuerySuccess% 1TB Query Success % (54 TPC-DS Queries) 0% 20% 40% 60% 80% 100% 120% spark-parquet spark-orc presto-parquet presto-parquet 1TB&10TB Query Success %(54 TPC-DS Queries) 0 2 4 6 8 10 12 14 16 Ceph issue Compatible issue Deployment issue Improper default configuration Middleware issue Runtime issue S3a driver issue Count of Issue Type • Lots of tunings & trouble shootings required to achieve 100% success ratio for selected TPC-DS queries • Improper Default configuration • Wrong middleware configuration • Improper Hadoop/Spark configuration for different size and format data issues tuned 8
  • 9. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Deployment Architecture challenges: Multiple choices Deployment Architecture depends on detail HW configuration, application and cost requirements 9 Architecture 1 Architecture 2 Architecture 3 Architecture 4 Architecture 5 1: Dedicated Load Balance 2: Round Robin DNS and dedicated gateway 3: Round Robin DNS, gateway co-located with Storage node 4: Fully disaggregated architecture, multiple storage solutions on the disaggregated storage node 5: Alluxio cache layer deployed on the compute node
  • 10. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice S3A Ceph cloud adaptor challenges: S3a tunings an example S3A performance is dramatically worse than HDFS Remote HDFSS3A Ceph stage 0 stage > 0 stage 0 stage > 0 stage 0 stage > 0 Took 820secs with BW of 120MB/s • From disk io and network io data, we can see read Bandwidth on Ceph is extremely low, about 100MB/s vs. 3GB/s on HDFS. • And based on our experience, Ceph is capable drive disk BW to more than 2GB/s. • S3A adaptor is the bottleneck. <property> <name>fs.s3a.readahead.range</name> <value>1024K</value> </property> <property> <name>fs.s3a.experimental.input.fadvise</name> <value>random</value> </property> Tuning up readahead range will decrease S3A opened connections. S3A 11.5x improved!! 10
  • 11. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Ingest cluster ETL Transform ation cluster Compute resource pool Disaggregated Storage Simulating typical usage cases: Simple Read/Write § Terasort: a popular benchmark that measures the amount of time to sort one terabyte of randomly distributed data on a given computer system. TPC-DS derived tests: Batch Analytics § To consistently executing analytical process to process large set of data. § UC11: Leveraging 54 derived from TPC-DS * queries with intensive reads across objects in different buckets § I/O intensive queries: selected 9 I/O intensive queries from TPC-DS Kmeans § K-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. Performance gaps: usage cases 11 Batch query cluster Interactive query cluster Machine Learning cluster
  • 12. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Performance gaps with storage disaggregation • Storage disaggregation leads to performance regression • Up to 10% for remote HDFS, Terasort performance is higher as usable memory increased • Up to 60% for S3 object storage (optimized results, up to 11.5x perf. boost through tunings compared with default parameters) • One important cause for the performance gap: s3a does not support Transactional Writes • Most of bigdata software (Spark, Hive) relies on HDFS’s atomic rename feature to support atomic writes • During job submit, commit protocol is used to specify how results should be written at the end of job, first stage task output into temporary locations, and only moving (renaming) data to final location upon task or job completion • S3a implements this with: COPY+DELETE+HEAD+POST • Despite there are some on-going efforts to optimize s3a adaptor, there is no near-term solution for the performance gap 1.0 1.0 1.0 1.0 0.9 0.9 1.1 0.9 0.7 0.6 0.4 0.5 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Batch Query (54quiries) IO INTENSIVE(7 quiries) TERASORT 1T KMEANS 374g Performance Comparision of Disaggregated analytics storage spark(yarn) + Local HDFS (HDD) spark(yarn) + Remote HDFS (HDD) spark(yarn) + S3 (HDD) higher is better 12 Need to close the performance gap!
  • 13. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Alluxio based IN Memory data accelerator (IMDA) Shared Data Lake with s3a object storage Batch Streaming Interactive Machine Leaning Graph Analytics Shared Data Lake with s3a object storage Batch Streaming Interactive Machine Leaning Graph Analytics Provisioned Compute Pool In Memory Data Acclerator Replace HDFS with disaggregated s3 object storage Alluxio based In Memory Accleration layer 13
  • 14. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver Independent scaling of compute & storage
  • 15. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Persistent Memory and RDMA Persistent Memory: • PMEM represents a new class of memory and storage technology architected specifically for data center usage • Combination of high-capacity, affordability and persistence. RDMA: Remote Direct Memory Access • Accessing (i.e. reading from or writing to) memory on a remote machine without interrupting the processing of the CPU(s) on that system. • Zero-copy - applications perform data transfer without the network software stack involvement, data is being send received directly to the buffers without being copied between the network layers. • Kernel bypass - applications perform data transfer directly from userspace, no context switches. • No CPU involvement - applications can access remote memory without consuming any CPU in the remote machine. Picture source: https://software.intel.com/en-us/blogs/2018/10/30/intel-optane-dc-persistent-memory-a-major-advance-in-memory-and-storage-architecture 15
  • 16. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Persistent Memory Operations Mode IMC Cascade Lake IMC • 128, 256, 512GB DIMM Capacity • 2666 MT/sec Speed • 3TB (not including DRAM) Capacity per CPU Flexible, Usage Specific Partitions Non-Volatile Memory Pool DDR4 DRAM* DCPMM* AppDirect Storage Memory • DDR4 electrical & physical • Close to DRAM latency • Cache line size access DRAM, or DRAM as cache * DIMM population shown as an example only. 1 MEMORY mode Storage over APP DIRECT ● Large memory at lower cost ● Low latency persistent memory ● Fast direct-attach storage ● Persistent data for rapid recovery2 APP DIRECT mode 16
  • 17. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Leveraging In memory data accelerator to accelerate intermediate data access Applications Disaggregated Storage Hbase*Ceph* Resource Mgmt & Co-ordination ZooKeeper*YARN* Data Processing & Analysis MR* Storm* Parquet* Avro* Spark Core SQL* Streaming* Mllib* GraphX* DataFrame ML Pipelines SparkR* Flink* Giraph* Batch StreamingInteractive Machine Leaning Graph Analytics HDFS* OSS* Acceleration Layer Alluxio* • Leverage new HW technologies & products that delivers significant performance improvement • Persistent memory, RDMA • Using Alluxio based in memory data accelerator layer to accelerate ephemeral data access • Caching hot data in Alluxio shorten I/O stack • Unifies underlying Filesystem • It requires a storage and network co-design to fully leverage those technologies or HWs address the bottlenecks • Optimized libraries to bypass filesystem, avoid user space/kernel space context switch … k8s* High Speed Networking 17
  • 18. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice In memory data accelerator (IMDA) architecture Enable Alluxio with state-of-art HW technology § Alluxio as a light-weighted user space I/O based distributed data store – For ephemeral data access like cache, shuffle, spill – Tiered storage – DRAM, persistent memory and SSDa § Persistent Memory to enlarge compute storage with high performance and low cost § RDMA to avoid context switch, kernel bypass – Persistent Memory mmap address as RDMA buffer to avoid memory copies – Persistent Memory as off-heap memory to improve GC § Long term: Customized shuffler for shuffle data, spill data to Alluxio IMDA Shared Data Lake with s3a object storage Batch Streaming Interactive Machine Leaning Graph Analytics Provisioned Compute Pool in memory data accelerator Ephemeral data RDMA enabled Network Persistent MemoryDRAM NVMe SSD 18
  • 19. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Alluxio IMDA system configuration 5x Compute Node Hardware: • ntel® Xeon™ processor Gold 6140 @ 2.3GHz, 384GB Memory • 1x 82599 10Gb NIC • 5x P4500 SSD (2 for spark-shuffle) Software: • Hadoop 2.8.1 • Spark 2.2.0 • Hive 2.2.1 • RHEL7.3 5x Storage Node • Intel(R) Xeon(R) CPU Gold 6140 @ 2.30GHz, 192GB Memory • 2x 82599 10Gb NIC • 7x 1TB HDD for Ceph bluestore or HDFS namenode and datanode Software: • Hadoop 2.8.1 • Ceph 12.2.7 • RHEL7.3 *Other names and brands may be claimed as the property of others. 19 Hadoop Hive Spark Alluxio DNS Hadoop Hive Spark Hadoop Hive Spark Hadoop Hive Spark Hadoop Hive Spark CEPH MON RGW REMOTE HDFS NN 1x10Gb NIC Alluxio Alluxio Alluxio Alluxio OSD DN CEPH RGW REMOTE HDFS OSD DN CEPH RGW REMOTE HDFS OSD DN CEPH RGW REMOTE HDFS OSD DN CEPH RGW REMOTE HDFS OSD DN Alluxio Acceleration Layer • 200GB Mem for mem mode Software: • Alluxio 1.7.0
  • 20. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Alluxio IMDA Performance Using Alluxio IMDA as cache: • For terasort, 3.4x speedup over S3 object storage, 1.36x speedup over local HDFS. • For TPCDS test, up to 1.56x performance speedup for IO intensive queries, slightly lower than local HDFS. • For KMeans test, 1.62x speedup over S3 object storage, 14% lower compared with local HDFS. • KMeans is a CPU intensive workload 1.00 1.00 1.00 1.00 0.70 0.62 0.40 0.53 0.96 0.97 1.36 0.86 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Batch Query (54quiries) IO INTENSIVE(7 quiries) TERASORT 1T KMEANS 374g Alluxio Acceleration of Disaggregated analytics storage spark(yarn) + Local HDFS (HDD) spark(yarn) + S3 (HDD) spark(yarn) +alluxio(MEM) + S3 (HDD) higher is better Using Alluxio IMDA cache improved in IO intensive workloads 20
  • 21. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 21 Alluxio DCPMM Tier architecture Alluxio PMEM tier • A new PMEM tier layer introduced to provide higher performance with lower cost • Large Capacity -> Cache more data • Higher performance compared with NVMe SSD • Leverage PMDK lib to bypass filesystem overhead and context switches • Deliver dedicated SLA to mission critical applications DRAM DCPMM SSD HDD Under Storage Application s Alluxio Worker Alluxio Master Alluxio Client
  • 22. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 22 New PMEM tier for Alluxio Two modes to support different usage scenario • SoAD mode • No code changes. • Bypass pagecache • PMDK AD mode • Bypass pagecache & no context switches • Better cache load performance. DCPMM PMDK based AD mode worker POSIX filesystem Pagecache DAX filesystem Memory mapped file client Context Switches POSIX Context Switches workerclient workerclient Userspace Load/Store JNI Storage over App Direct (SoAD) Mode PMDK
  • 23. Use Cases Alluxio Enables Burst big data workloads in hybrid cloud environments Same instance / container Accelerate big data frameworks on the public cloud Same instance / container Dramatically speed-up big data on object stores on premise Same container / machine or or Alluxio Presto Alluxio Presto Alluxio Presto Alluxio PrestoHive Alluxio Hive Alluxio Hive Alluxio Hive Alluxio Alluxio Spark AlluxioAlluxio Spark Alluxio SparkSpark
  • 24. Data Elasticity with a unified namespace Abstract data silos & storage systems to independently scale data on-demand with compute Run Spark, Hive, Presto, ML workloads on your data located anywhere Accelerate big data workloads with transparent tiered local data Data Accessibility for popular APIs & API translation Data Locality with Intelligent Multi-tiering Alluxio – Key innovations
  • 25. Data Locality with Intelligent Multi-tiering Local performance from remote data using multi-tier storage Hot Warm Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion,TTL
  • 26. Data Accessibility via popular APIs and API Translation Convert from Client-side Interface to native Storage Interface Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift DriverS3 Driver NFS Driver
  • 27. Data Elasticity via Unified Namespace Enables effective data management across different Under Store - Uses Mounting withTransparent Naming
  • 28. Alluxio MasterZookeeper / RAFT Standby Master WAN Alluxio Client Alluxio Client Alluxio Worker RAM / SSD / HDD Alluxio Worker RAM / SSD / HDD Alluxio Reference Architecture … … Application Application Under Store 1 Under Store 2
  • 29. Enterprises moving towards independent compute & storage Learn more
  • 30. Incredible Open Source Momentum with growing community 1000+ contributors & growing 4000+ Git Stars Apache 2.0 Licensed Hundreds of thousands of downloads Join the conversation on Slack alluxio.org/slack
  • 31. Questions? Join the Alluxio Community www.alluxio.org | www.alluxio.com | @alluxio
  • 32. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 32 Call to action • Stay tuned for further updates • More details • Speeding Big Data Analytics on the Cloud with an In-Memory Data Accelerator • https://www.alluxio.io/blog/speeding-big-data-analytics-on-the-cloud-with-in- memory-data-accelerator/
  • 33. Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Legal Information: Benchmark and Performance DisclaimersPerformance results are based on testing as of Feb. 2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information, see Performance Benchmark Test Disclosure. Configurations: see performance benchmark test configurations. 33