Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Open Source Data Orchestration for AI, Big Data, and Cloud

151 vues

Publié le

Beijing Meetup
Haoyuan Li

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Open Source Data Orchestration for AI, Big Data, and Cloud

  1. 1. Open Source Data Orchestration for AI, Big Data, and Cloud Haoyuan (H.Y.) Li | Founder & Chairman & CTO | haoyuan@alluxio.com 2019-06-22 @ Beijing
  2. 2. Open Source Started From UC Berkeley AMPLab 1000+ contributors & growing 4000+ Git Stars Apache 2.0 Licensed GitHub’s Top 100 Most Valuable Repositories Out of 96 Million Join the conversation on Slack slackin.alluxio.io
  3. 3. Companies Using Alluxio - Read More (Including 8 of the Top 10 Internet Companies in China)
  4. 4. 4 big trends driving the need for a new architecture Separation of Compute & Storage Hybrid – Multi cloud environments Self-service data across the enterprise Rise of the object store
  5. 5. Data Ecosystem - Beta Data Ecosystem 1.0 COMPUTE STORAGE STORAGE COMPUTE
  6. 6. Co-located Data stack journey and innovation paths Co-located compute & HDFS on the same cluster Disaggregated compute & HDFS on the same cluster MR / Hive HDFS Hive HDFS Disaggregated Burst compute in the cloud, public or private, using on- premise (e.g. HDFS) data. Support Presto, Spark across DCs without app changes Enable & accelerate machine learning & analytics on object stores Transition to Public Cloud / Object store Enable Hybrid Cloud Support more frameworks § Typically compute-bound clusters over 100% capacity § Compute & I/O need to be scaled together even when not needed § Compute & I/O can be scaled independently but I/O still needed on HDFS which is expensive
  7. 7. Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver Independent scaling of compute & storage
  8. 8. § S3 performance is variable and consistent query SLAs are hard to achieve § S3 metadata operations are expensive making workloads run longer § S3 egress costs add up making the solution expensive § S3 is eventually consistent making it hard to predict query results Challenges with running Big Data & AI workloads on S3 & Alluxio Solution Compute caching for S3 Accelerate big data frameworks on the public cloud Same instance / container Alluxio Spark AlluxioAlluxio Spark Alluxio SparkSpark
  9. 9. Hive Alluxio Hive Alluxio Hive Alluxio § Accessing data over WAN too slow § Copying data to compute cloud time consuming and complex § Using another storage system like S3 means expensive application changes § Using S3 via HDFS connector leads to extremely low performance Challenges with Hybrid Cloud & Alluxio Solution HDFS for Hybrid Cloud Hive Alluxio Burst big data workloads in hybrid cloud environments Same instance / container 3 Solution Benefits § Same performance as local § Same end-user experience § 100% of I/O is offloaded
  10. 10. Challenges with supporting more frameworks & Alluxio Solution § Running new frameworks on existing an HDFS cluster can dramatically affect performance of existing workloads § In a disaggregate environment, copying data to multiple compute clouds time consuming and error prone § Migrating applications for new storage systems is complex & time consuming § Storing and managing multiple copies of the data becomes expensive Support more frameworks Any object store or HDFS Same data center / region Presto Enable big data on object stores across single or multiple clouds or Spark Alluxio Alluxio 4
  11. 11. Alluxio Presto Alluxio Presto Challenges running AI & Big Data on Object Stores & Alluxio Solution § Object stores performance for big data workloads can be very poor § No native support for popular frameworks § Expensive metadata operations reduce performance even more § No support for hybrid environments directly Transition to Object store Dramatically speed-up big data on object stores on premise Same container / machine or or 5 Solution Benefits § Same performance as HDFS § Uses HDFS APIs § Same end-user experience § Storage at fraction of the cost of HDFS Alluxio Presto Alluxio Presto
  12. 12. Advanced Use Cases Spark Alluxio Any Cloud / Multi Cloud Same data center / region Presto Enable big data on object stores across single or multiple clouds Standalone Spark Alluxio Orchestrate data frameworks on the public cloud Any public / private cloud or or PrestoHive
  13. 13. Data Elasticity with a unified namespace Abstract data silos & storage systems to independently scale data on-demand with compute Run Spark, Hive, Presto, ML workloads on your data located anywhere Accelerate big data workloads with transparent tiered local data Data Accessibility for popular APIs & API translation Data Locality with Intelligent Multi-tiering Alluxio – Key innovations
  14. 14. Data Locality with Intelligent Multi-tiering Local performance from remote data using multi-tier storage Hot Warm Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion,TTL
  15. 15. Data Accessibility via popular APIs and API Translation Convert from Client-side Interface to native Storage Interface Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift DriverS3 Driver NFS Driver
  16. 16. Data Elasticity via Unified Namespace Enables effective data management across different Under Store - Uses Mounting withTransparent Naming
  17. 17. Unified Namespace: Global Data Accessibility Transparent access to understorage makes all enterprise data available locally SUPPORTS • HDFS • NFS • OpenStack • Ceph • Amazon S3 • Azure • Google Cloud IT OPS FRIENDLY • Storage mounted into Alluxio by central IT • Security in Alluxio mirrors source data • Authentication through LDAP/AD • Wireline encryption HDFS #1 Object Store NFS HDFS #2
  18. 18. APIs to Interact with data in Alluxio Spark Presto POSIX Java Application have great flexibility to read / write data with many options > rdd = sc.textFile(“alluxio://localhost:19998/myInput”) CREATE SCHEMA hive.web WITH (location = 'alluxio://master:port/my-table/') $ cat /mnt/alluxio/myInput FileSystem fs = FileSystem.Factory.get(); FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
  19. 19. Alluxio MasterZookeeper / RAFT Standby Master WAN Alluxio Client Alluxio Client Alluxio Worker RAM / SSD / HDD Alluxio Worker RAM / SSD / HDD Alluxio Reference Architecture … … Application Application Under Store 1 Under Store 2
  20. 20. DATA ORCHESTRATION SPARK HDFS SPARK HDFS Public Cloud Public Cloud § Compute scales elastically independent of storage § Faster time to insights with seamless data orchestration § Accelerated workloads with memory-first data approach Two Sigma Fastest growing big hedge fund managing $50 billion for investors Use case | Cloud bursting on-premise data https://www.alluxio.io/resources/case-studies/two-sigma-case- study-cloud-bursting-with-spark-for-on-premise-hadoop/
  21. 21. Use case | Data orchestration for agility DATA ORCHESTRATION SPARK HDFS SPARK Kubernetes OBJECT HBASE ETLSPARK HDFS OBJECT HBASE § Single namespace to access & address all data § Data local to compute accelerates workloads China Unicom Leading ChineseTelco serving 320 million subscribers https://www.alluxio.io/blog/china-unicom-uses-alluxio-and- spark-to-build-new-computing-platform-to-serve-mobile-users/
  22. 22. Use Case | On-premise Caching for Presto HDFS § Large query variance during peak hours before § Alluxio brings data local to Presto to reduce the latency during peak hours NetEase Games Leading Online Game Company in China https://www.alluxio.io/blog/presto-on-alluxio-how-netease- games-leveraged-alluxio-to-boost-ad-hoc-sql-on-hdfs/ Presto HDFS Presto Alluxio
  23. 23. Architecture: Colocate Alluxio with Presto • Black/Red line – Large Query variance without Alluxio • Green line - Stable query time with Alluxio
  24. 24. Questions? Welcome to join the Alluxio Open Source Community! www.alluxio.io | @alluxio | slackin.alluxio.io