Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Running Spark & Alluxio in Kubernetes

148 vues

Publié le

This was presented by Bin Fan and Adit Madan in the June community office hour.

For future events, check out: https://www.alluxio.io/events/

Publié dans : Logiciels
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Running Spark & Alluxio in Kubernetes

  1. 1. Running Spark & Alluxio in Kubernetes Bin Fan (binfan@alluxio.com) Adit Madan (adit@alluxio.com) 1
  2. 2. Agenda ● Alluxio Overview ● Kubernetes Overview ● Alluxio & Kubernetes ● Demo: ○ Spark + Alluxio on Kubernetes 2
  3. 3. Alluxio Overview 3
  4. 4. The Alluxio Story 4 Originated as Tachyon project, at UC Berkley AMPLab by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li.2013 2015 Open Source project established & company to commercialize Alluxio founded Goal: Orchestrate Data at Memory Speed for the Cloud for data driven apps such as Big Data Analytics, ML and AI. 2018 2018 2019
  5. 5. Fast-growing Open Source Community 5 4000+ Github Stars1000+ Contributors Join the community on Slack alluxio.io/slack Apache 2.0 Licensed Contribute to source code github.com/alluxio/alluxio
  6. 6. Data Ecosystem - Beta Data Ecosystem 1.0 COMPUTE STORAGE STORAGE COMPUTE 6
  7. 7. Co-located Data stack journey and innovation paths Co-located compute & HDFS on the same cluster Disaggregated compute & HDFS on the same cluster MR / Hive HDFS Hive HDFS Disaggregated Burst compute in the cloud, public or private, using on- premise (e.g. HDFS) data. Support Presto, Spark across DCs without app changes Enable & accelerate machine learning & analytics on object stores Transition to Public Cloud / Object store Enable Hybrid Cloud Support more frameworks ▪ Typically compute-bound clusters over 100% capacity ▪ Compute & I/O need to be scaled together even when not needed ▪ Compute & I/O can be scaled independently but I/O still needed on HDFS which is expensive 7
  8. 8. Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver GCS Driver S3 Driver Azure Driver Independent scaling of compute & storage 8
  9. 9. Flexible APIs to Interact with data in Alluxio Spark Presto POSIX Java > rdd = sc.textFile(“alluxio://localhost:19998/myInput”) CREATE SCHEMA hive.web WITH (location = 'alluxio://master:port/my-table/') $ cat /mnt/alluxio/myInput FileSystem fs = FileSystem.Factory.get(); FileInStream in = fs.openFile(new AlluxioURI("/myInput")); 9
  10. 10. Alluxio MasterZookeeper / RAFT Standby Master Alluxio Worker Alluxio Worker Alluxio Reference Architecture … … Application Application Under Store 1 Under Store 2
  11. 11. Kubernetes Overview 11
  12. 12. Container Orchestration Platform ● Abstract Physical Infrastructure ● Service Discovery ● Self-healing ● Secret Management ● Storage Management 12
  13. 13. Key Kubernetes Concepts ● Containers ○ OS and application binaries ● Pods ○ Schedulable unit of one or more containers ● Controllers ○ Controls the desired state such as copies of a Pod ● Persistent Volumes ○ Storage provisioned by admin with lifecycle independent of a Pod 13
  14. 14. Alluxio on Kubernetes 14
  15. 15. Elastic Data for Elastic Compute with Tight Locality Host Alluxio Spark AlluxioAlluxio Spark Alluxio SparkSpark or ● Data locality ○ Big data analytics or ML ○ Cache data close to compute ● Enable high-speed data sharing across jobs ○ A closer staging storage layer for compute jobs ● Unification of persistent storage ○ Same data abstraction across different storage 15
  16. 16. Alluxio on Kubernetes Alluxio Master Alluxio Worker Alluxio Service /journal EBS Volume Persistent Volume Claim 16
  17. 17. Spark on Alluxio on Kubernetes Alluxio Master Spark Driver Alluxio Worker Spark Executor Alluxio Service /journal EBS Volume Persistent Volume Claim 17
  18. 18. Deploying Alluxio ● Provision a Persistent Volume for Journal $ kubectl create -f alluxio-journal-volume.yaml ● Specify Configuration $ kubectl create configmap alluxio-config --from-env- file=conf/alluxio.properties ● Deploy Master $ kubectl create -f alluxio-master.yaml ● Deploy Workers $ kubectl create -f alluxio-worker.yaml 18
  19. 19. Demo: Running Spark & Alluxio on Kubernetes 19
  20. 20. Summary ● An Overview of Data Orchestration ● Alluxio enables elastic data for elastic compute in Kubernetes ○ Data Locality on Demand, Data Abstraction & Unification, Data Sharing ● A guide to run Alluxio in Kubernetes environment ● A demo of running Spark on Alluxio in Kubernetes 20
  21. 21. Appendix 21
  22. 22. Kubernetes w/ Alluxio Components ● Containers: Master, Job Master, Worker, Job Worker ● Pods: Master, Worker ● Service: Master ● Daemon Sets: Workers ● Persistent Volumes: Master Journal, Worker Tiered Storage (optional) ● HostPath Volume: Worker Domain Socket Directory 22

×