%in tembisa+277-882-255-28 abortion pills for sale in tembisa
Unified Big Data Analytics: Any Stack, Any Cloud
1. Unified Big Data Analytics:
Any stack, Any Cloud
1
2019/01/22 @ CloudHealth by VMware
Follow us | @alluxio
Download Alluxio | www.alluxio.org
Questions? | http://alluxio.org/slack
2. About Me
• Bin Fan
• PhD CS@CMU
• Founding Engineer@Alluxio
2
Email: binfan@alluxio.com
Github: apc999
Twitter: @binfan
3. Company
Overview
• Founded Feb. 2015 – Haoyuan Li
• PhD research at UC Berkeley AMPLab
• Initially Tachyon Nexus
• Venture Backed
• Andreessen Horowitz, Seven Seas etc.
• Open Source Business Model
• Tachyon Open Sourced in Dec. 2012
• Open source v1.0 released Feb. 2016
• Commercial product released Oct. 2016
• Office in San Mateo, CA
• Team: Google, Palantir, Vmware, AMD, Cisco…
6. Data Transformation
6
• Pressure in all industries to be
“data driven”
• Majority of companies still figuring out
the transformation
• Increased collection of numerous,
low-value data
• Challenge of overcoming data silos to
convert data into business value
• Limited success of Data Warehouse,
Mart, and Lakes – cost of
copying/moving data is substantial
• Single Data Plane for Business
value
7. Migration to Cloud
7
• Decoupling of compute and
storage
• Enterprise move from turnkey
solution to self managed data
platforms on IaaS
• Lacking agility at Data Storage
level
• Requires Storage Abstraction
8. Rise of Artificial Intelligence
8
• New workload targeting the
same data used for Hadoop
OLAP
• Potentially can be as
valuable or more valuable
than existing OLAP
workloads
• Challenge of adapting
existing architectures to
this new workload
10. The Data Access Layer
10
• Abstraction layer between applications and storage systems
• Present a stable storage interface to applications, including
semantics, security, and performance
• Eliminate weakness of data silos instead of data silos
themselves
• Enable transparent migration of underlying storage systems
• Enable application API to storage API translation in a single
layer
11. Data Access Layer
Data Access Layer
Security Standard APIsHigh Performance
Compatibility Decoupling
Transparent
Migration
11
12. Alluxio
12
• Our implementation of the data access layer – a virtual
distributed file system
• Open source project with over 900 contributors from 100s of
organizations worldwide
• Deployed in many top internet and financial companies
15. Data Ecosystem with Alluxio
• Apps only talk to Alluxio
• Simple Add/Remove
• No App Changes
• Highest performance in
Memory
• No Lock in
Alluxio, a Virtual Distributed File System (VDFS)
Java File API
HDFS
Interface
S3 Interface REST API
HDFS Driver S3 Driver Swift Driver NFS Driver
FUSE
Interface
15
17. Read Data not Cached in Alluxio + Caching
17
RAM / SSD / HDD
Application
Alluxio
Client
Alluxio
WorkerUnder Store 12
3
4
4
18. Read Cached Data in Alluxio
Alluxio
Worker
RAM / SSD / HDD
Application
Alluxio
Client
18
1
2
3
19. Write data only to Alluxio
Alluxio
Worker
RAM / SSD / HDD
Application
Alluxio
Client
19
1
2
3
20. Write to Alluxio and Under Store Synchronously
RAM / SSD / HDD
Application
Alluxio
Client
Alluxio
Worker
Under Store
20
12
2
3
21. A Common File System Abstraction
21
• Common interface across apps
• HDFS-compatible interface:
change hdfs://foo/bar to
alluxio://foo/bar
• Other interfaces: Native Alluxio Java
FS, POSIX and S3.
• Cloud storage becomes “hidden”
to apps
• Less vendor lock-in!
Compute Zone
Standalone or managed with Mesos or Yarn
Storage in Different Availability Zone
Either on-prem or cloud
TensorflowPrestoMR
HDFS API POSIX API
22. Data Path: Improved I/O Performance
22
• A New Tier Above Cloud Storage for Compute
• Distributed buffer cache
• Restore locality to compute
• Read:
• Cache-hit read: served by Alluxio workers (local worker preferred)
• Cache-miss read: served by cloud storage, then cache to Alluxio worker
• Write:
• Burst buffer, then async propagate to S3 (Alluxio 2.0)
• Challenges:
• Locality: expose location information to applications; serve local apps
through ramdisk (rather than network)
23. Data Path: Async Persist to S3 (Alluxio 2.0)
23
RAM / SSD / HDD
Application
Alluxio
Client
Alluxio
Master
Alluxio
Worker
Under Store
• Async Writes
• Step1: App writes to Alluxio
• Step2: Alluxio writes to UFS
• Benefits
• Apps writes in Alluxio speed
• Data gets persisted
• Challenges
• File rename/delete before
persist: 2PC
• Fault-tolerance: journal async
requests
24. Metadata Path: Familiar Semantics
24
• Listing / renaming on object store can be expensive
• Common operations for batch or SQL analytics
• Overwriting Put is eventually consistent
• Alluxio loads and manages metadata in master
• Apps can continue assuming HDFS-like semantics and performance
implication
• Challenges
• Data modification bypassing Alluxio: when and how to re-sync
• Slow lists in object store: batch operations
• Too many objects: off-heap metadata (Alluxio 2.0)
25. Metadata Path: Efficient Renames
25
• Rename files on S3 can be expensive
• Common operations for MR in commit phase
• Write results to tmp paths
• Rename tmp files to final paths (another copy, slow)
• Rename with Alluxio async writes
• t0: writes to tmp paths in Alluxio: near-compute, fast writes
• t1: rename tmp paths to final path in Alluxio: cheap renames
• t2: persist files in final paths in Alluxio to S3: 2PC to avoid partial data
• Speculative execution allowed
27. Sandbox: TPCDS + Alluxio + S3
27
• Setup
• TPCDS reads from S3, vs
• TPCDS reads from Alluxio backed by S3
• To play with Sandbox, make a request:
https://www.alluxio.org/sandbox/request
28. Case Study:
- Digital marketing SaaS platform
- Hive metastore: Input files from S3
- ~100 TB data on S3
- Pain Point:
- Slow to list files
- Limited EC2<->S3 bw
- No compute side caching
- No data locality
https://www.slideshare.net/ThaiBui7/hybrid-collaborative-tiered-storage-with-alluxio
28
35. Case Study:
35
- Leading Online Gaming Service Company (NASDAQ: NTES)
- Partner with Blizzard to operate service of “WoW”, “Hearthstone”
- Coming “Diablo Immortal”
- Building Ad-hoc SQL Query Engine
- Large data volume: ~30 TB raw data daily
- A separate satellite compute cluster
- Pain Point:
- Requirement in response time: < 15s
- Large startup latency on submitting SQL jobs as YARN app
37. Result: Smoother Response During Peak Time
37
Response time (ms)
Presto w/ Alluxio
Presto w/o Alluxio
38. - Alluxio: A New Data Access Layer
- Between compute and storage
- Transparent to bigdata analytics (HDFS-compatible, POSIX)
- Improve data and metadata performance on cloud storage
- Architecture and Data Flow
- Master, Worker, Under Storage
- Cache-{hit, miss} reads, Sync/Async writes
- Use Cases on Alluxio
Conclusion
38