SlideShare a Scribd company logo
1 of 16
Download to read offline
SuperVessel: Enabling Spark as a
Service with OpenStack and Docker
Guan Cheng (G.C.) Chen
IBM Research - China
weibo.com/parallellabs
SuperVessel Cloud
4/18/15 IBM	
  Research	
  -­‐	
  China 2  
•  Public	
  cloud	
  built	
  on	
  the	
  POWER7/POWER8	
  servers	
  with	
  OpenStack	
  
•  It	
  provides	
  free	
  access	
  for	
  students,	
  researchers,	
  developers	
  across	
  the	
  world,	
  and	
  
helps	
  grow	
  OpenPOWER	
  ecosystem	
  (used	
  in	
  30+	
  universiNes	
  now)	
  
•  It	
  provides	
  advanced	
  technology	
  services	
  such	
  as	
  Spark	
  as	
  a	
  Service,	
  Docker	
  Services,	
  
CogniNve	
  CompuNng	
  Service,	
  IoT	
  Service,	
  Accelerator	
  as	
  a	
  Service	
  (FPGA	
  and	
  GPU)	
  
Spark as a Service
4/18/15 IBM	
  Research	
  -­‐	
  China 3  
Step	
  1:	
  Login Step	
  2:	
  Create Step	
  3:	
  Ready!
3 steps to launch a Spark cluster, easy!
Try	
  it	
  at:	
  www.ptopenlab.com
Why OpenStack?
4/18/15 IBM	
  Research	
  -­‐	
  China 4  
•  Most	
  popular	
  IaaS	
  soXware	
  
•  Supports	
  Docker*	
  
•  Heat	
  can	
  orchestra	
  docker	
  containers	
  easily	
  –	
  good	
  for	
  provision	
  a	
  Spark	
  
cluster	
  
Picture	
  source:	
  h-p://www.qyjohn.net/?p=3801
Why Docker?
•  Less	
  resource	
  consumpNon	
  than	
  
KVM	
  
–  We	
  can	
  provision	
  more	
  Spark	
  
clusters!	
  
•  Boot	
  faster	
  than	
  KVM	
  
–  Users	
  like	
  fast	
  provision!	
  
•  Incrementally	
  build,	
  revert	
  and	
  
reuse	
  your	
  container	
  
–  We	
  love	
  Git	
  and	
  AUFS!	
  
•  However,	
  Docker	
  is	
  not	
  officially	
  
supported	
  in	
  OpenStack	
  yet	
  
–  Nova	
  Docker	
  is	
  an	
  external	
  
component	
  of	
  OpenStack	
  
•  Port	
  Docker	
  to	
  POWER	
  
Architecture	
  (ppc64	
  and	
  ppc64le)	
  
–  Ubuntu	
  15.04	
  includes	
  Docker	
  for	
  
POWER8	
  ppc64le
4/18/15 IBM	
  Research	
  -­‐	
  China 5  
Picture	
  source:	
  h-p://www.zdnet.com/ar@cle/what-­‐is-­‐docker-­‐and-­‐why-­‐is-­‐it-­‐so-­‐
darn-­‐popular/
Why Spark?
•  Fast	
  
•  Unified	
  
•  Ecosystem	
  
	
  
PorNng	
  to	
  POWER	
  
–  Bugfix	
  submieed	
  to	
  the	
  
community	
  
–  Spark	
  1.3	
  works	
  
smoothly!	
  J	
  
4/18/15 IBM	
  Research	
  -­‐	
  China 6  
Why not Sahara?
Sahara is a component for Hadoop/Spark as a Service in OpenStack
•  We	
  started	
  from	
  OpenStack	
  Icehouse	
  …	
  
•  DockerizaNon	
  
– Beeer	
  service	
  deployment	
  and	
  isolaNon	
  for	
  Big	
  
Data	
  dashboard	
  server	
  
•  CustomizaNon	
  
– WaiNng	
  for	
  Sahara’s	
  improvements	
  is	
  somehow	
  
*SLOW*	
  
•  Docker,	
  user	
  analyNcs,	
  Spark	
  1.4,	
  Spark	
  IDE,	
  scheduling,	
  
data	
  visualizaNon	
  etc.	
  
4/18/15 IBM	
  Research	
  -­‐	
  China 7  
4/18/15 IBM	
  Research	
  -­‐	
  China 8  
Architecture Design
Big Data
Dashboard
Keystone
Glance
Neutron
Heat Nova
Nova
Docker
Cinder/
Manila
Spark Cluster
Docker
Image
Spark Master
Spark Worker
Spark Worker
Container 1
Container 2
Container 3
NameNode
Spark Driver
DataNode
DataNode
Billing&Auth
1
2
2 3
Dockerize everything!
•  We	
  use	
  containers	
  to	
  	
  
– run	
  applica<ons	
  
– run	
  other	
  containers	
  
– run	
  OpenStack	
  python	
  daemons	
  
– run	
  OpenStack	
  services	
  inside	
  OpenStack	
  (with	
  a	
  
special	
  trunk	
  link	
  in	
  neutron/OVS)	
  
– run	
  OpenStack	
  inside	
  OpenStack	
  
•  Good	
  for	
  mulN-­‐site	
  expansion	
  
4/18/15 IBM	
  Research	
  -­‐	
  China 9  
Heat template design
Heat is a component for orchestration in OpenStack
•  Parameters	
  
– Cinder/Malina/Neutron	
  uuid	
  
– Size	
  of	
  Cinder/Malina	
  resources	
  
•  Resources	
  
– Master/slave	
  node	
  
– Neutron	
  	
  
– Cinder/Manila	
  
•  Need	
  to	
  modify	
  nova-­‐docker	
  to	
  mount	
  the	
  cinder/
manila	
  resources	
  when	
  booNng	
  the	
  docker	
  container	
  
4/18/15 IBM	
  Research	
  -­‐	
  China 10  
Spark Docker Image
•  Built	
  from	
  Ubuntu	
  14.04.1	
  
•  All	
  Spark	
  nodes	
  use	
  the	
  same	
  image,	
  with	
  
different	
  iniNalizaNon	
  scripts	
  by	
  using	
  cloudinit	
  
•  IniNalizaNon	
  scripts	
  will	
  
– Sync	
  /etc/hosts	
  across	
  all	
  nodes	
  
– Set	
  HDFS	
  and	
  Spark	
  configuraNons	
  accordingly	
  
– Format	
  HDFS	
  and	
  launch	
  HDFS	
  
– Launch	
  Spark
4/18/15 IBM	
  Research	
  -­‐	
  China 11  
Big Data Dashboard Development
•  2	
  Developers	
  (frond	
  end	
  +	
  backend)	
  
– Online	
  in	
  2	
  months	
  
– Separates	
  Heat	
  related	
  stuff	
  and	
  Dashboard	
  
– Reskul	
  API	
  (for	
  billing	
  and	
  authenNcaNon	
  etc)	
  
– Dockerize	
  the	
  big	
  data	
  dashboard	
  server	
  
– Separates	
  development	
  and	
  producNon	
  
environment
4/18/15 IBM	
  Research	
  -­‐	
  China 12  
Where should I put the data?
Shared File System for Cloud and Spark as a Service
4/18/15 IBM	
  Research	
  -­‐	
  China 13  
Docker
(Symphon
y)
Horizon
OpenStack controller
HEAT
NeutronGlance Manila
Nova
Cloud  Infrastructure  
Service  
Big  Data  Service  
•  Select	
  Big	
  data	
  compuNng	
  
framework	
  (Mapreduce,	
  
SPARK	
  
•  Select	
  cluster	
  size	
  
•  Select	
  data	
  folder	
  size
HEAT	
  template	
  for	
  big	
  data	
  cluster
Docker
(Symphon
y)
Docker
(Symphon
y)
Docker
(Symphon
y)
Docker
(Symphon
y)
Docker
(SPARK)
POWER7/POWER8
KVM/
Docker
(Web app)
Folder	
  A
User	
  BUser	
  A
Folder	
  B
User	
  A
•  HEAT	
  will	
  orchestrate	
  docker	
  instances,	
  subnet	
  and	
  data	
  folder	
  based	
  on	
  user’s	
  request	
  
•  Manila	
  provides	
  the	
  NFS	
  service	
  using	
  GPFS	
  as	
  backend,	
  and	
  the	
  folder	
  will	
  be	
  mounted	
  via	
  nova-­‐docker	
  (with	
  -­‐v	
  support)	
  
•  Folder	
  created	
  by	
  Manila	
  could	
  be	
  accessed	
  by	
  the	
  KVM/docker	
  instances	
  created	
  for	
  big	
  data	
  and	
  other	
  purpose	
  
GPFS	
  FPO
POWER7/POWER8
Servers
GPFS	
  FPO
Servers
GPFS	
  FPO
KeyStone
Cinder
SuperVessel Services Roadmap
4/18/15 IBM	
  Research	
  -­‐	
  China 14  
SuperVessel	
  Cloud	
  Infrastructure	
  
SuperVessel	
  
Cloud	
  	
  
Service
SuperVessel	
  	
  
Big	
  Data	
  and	
  HPC	
  
Service
Super	
  	
  
Class	
  
Service
OpenPOWER	
  
Enablement	
  
Service
Super	
  Project	
  
Team	
  
Service
Super	
  Marketplace
1.  VM	
  and	
  container	
  
service	
  
2.  Storage	
  service	
  
3.  Network	
  service	
  
4.  Accelerator	
  as	
  
service	
  
5.  Image	
  service	
  
1.  Big	
  Data:	
  
MapReduce	
  
(Symphony),	
  SPARK	
  
2.  Performance	
  tuning	
  
service
1.  X-­‐to-­‐P	
  migraNon:	
  
AutoPort	
  tool	
  
2.  OpenPOWER	
  new	
  
system	
  test	
  
service
1.  On-­‐line	
  video	
  
courses	
  
2.  Teacher	
  course	
  
management	
  
3.  User	
  contribuNon	
  
management	
  
1.  Project	
  
management	
  
service	
  
2.  DevOps	
  
automaNon
Storage IBM	
  POWER	
  servers OpenPOWER	
  server FPGA/GPU
Docker
(Online)
(Online) (Preparing)
(Online)
Summary
•  Spark	
  +	
  OpenStack	
  +	
  Docker	
  works	
  very	
  well	
  on	
  OpenPOWER	
  
servers	
  
•  Dockerized	
  services	
  made	
  DevOps	
  easier	
  
•  Docker	
  issues	
  
–  Zombie	
  process	
  
–  Can’t	
  dynamically	
  aeach	
  volume	
  
–  Commit,	
  aeach	
  operaNons	
  is	
  not	
  supported	
  on	
  Nova-­‐docker	
  
•  Monitoring	
  everything	
  (API,	
  resources	
  etc)	
  
–  key	
  for	
  operaNng	
  a	
  cloud	
  
•  TODO:	
  Spark	
  1.4,	
  Spark	
  IDE,	
  SWIFT,	
  Data	
  VisualizaNon	
  
4/18/15 IBM	
  Research	
  -­‐	
  China 15  
4/18/15 IBM	
  Research	
  -­‐	
  China 16  
Join Us!
www.ptopenlab.com
QQ	
  group:	
  SuperVessel
SuperVessel	
  
WeChat	
  group
@冠诚  
chengc@cn.ibm.com  

More Related Content

What's hot

Configuration Management Evolution at CERN
Configuration Management Evolution at CERNConfiguration Management Evolution at CERN
Configuration Management Evolution at CERNGavin McCance
 
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARNHadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARNDataWorks Summit
 
Openshift YARN - strata 2014
Openshift YARN - strata 2014Openshift YARN - strata 2014
Openshift YARN - strata 2014Hortonworks
 
Dok Talks #119 - Cloud-Native Data Pipelines
Dok Talks #119 - Cloud-Native Data PipelinesDok Talks #119 - Cloud-Native Data Pipelines
Dok Talks #119 - Cloud-Native Data PipelinesDoKC
 
Open cloud infrastructure built for the enterprise
Open cloud infrastructure built for the enterpriseOpen cloud infrastructure built for the enterprise
Open cloud infrastructure built for the enterpriseRedHatInc
 
Kubernetes - A Short Ride Throught the project and its ecosystem
Kubernetes - A Short Ride Throught the project and its ecosystemKubernetes - A Short Ride Throught the project and its ecosystem
Kubernetes - A Short Ride Throught the project and its ecosystemMaciej Kwiek
 
The road to enterprise ready open stack storage as service
The road to enterprise ready open stack storage as serviceThe road to enterprise ready open stack storage as service
The road to enterprise ready open stack storage as serviceSean Cohen
 
Project Solum - OpenStack's Native PaaS
Project Solum - OpenStack's Native PaaSProject Solum - OpenStack's Native PaaS
Project Solum - OpenStack's Native PaaSAlex Baretto
 
Build Your Own PaaS, Just like Red Hat's OpenShift from LinuxCon 2013 New Orl...
Build Your Own PaaS, Just like Red Hat's OpenShift from LinuxCon 2013 New Orl...Build Your Own PaaS, Just like Red Hat's OpenShift from LinuxCon 2013 New Orl...
Build Your Own PaaS, Just like Red Hat's OpenShift from LinuxCon 2013 New Orl...OpenShift Origin
 
Apache Mesos Overview and Integration
Apache Mesos Overview and IntegrationApache Mesos Overview and Integration
Apache Mesos Overview and IntegrationAlex Baretto
 
"Using Automation Tools To Deploy And Operate Applications In Real World Scen...
"Using Automation Tools To Deploy And Operate Applications In Real World Scen..."Using Automation Tools To Deploy And Operate Applications In Real World Scen...
"Using Automation Tools To Deploy And Operate Applications In Real World Scen...ConSol Consulting & Solutions Software GmbH
 
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius Schumacher
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius SchumacherOSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius Schumacher
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius SchumacherNETWAYS
 
Mastering Terraform and the Provider for OCI
Mastering Terraform and the Provider for OCIMastering Terraform and the Provider for OCI
Mastering Terraform and the Provider for OCIGregory GUILLOU
 
Open shift 4 infra deep dive
Open shift 4    infra deep diveOpen shift 4    infra deep dive
Open shift 4 infra deep diveWinton Winton
 
Heat - keep the clouds up
Heat - keep the clouds upHeat - keep the clouds up
Heat - keep the clouds upKiran Murari
 
Red Hat presentatie: Open stack Latest Pure Tech
Red Hat presentatie: Open stack Latest Pure TechRed Hat presentatie: Open stack Latest Pure Tech
Red Hat presentatie: Open stack Latest Pure TechProxyServices
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Chris Fregly
 
Red Hat Enteprise Linux Open Stack Platfrom Director
Red Hat Enteprise Linux Open Stack Platfrom DirectorRed Hat Enteprise Linux Open Stack Platfrom Director
Red Hat Enteprise Linux Open Stack Platfrom DirectorOrgad Kimchi
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowDataWorks Summit
 

What's hot (20)

Configuration Management Evolution at CERN
Configuration Management Evolution at CERNConfiguration Management Evolution at CERN
Configuration Management Evolution at CERN
 
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARNHadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
 
Openshift YARN - strata 2014
Openshift YARN - strata 2014Openshift YARN - strata 2014
Openshift YARN - strata 2014
 
Dok Talks #119 - Cloud-Native Data Pipelines
Dok Talks #119 - Cloud-Native Data PipelinesDok Talks #119 - Cloud-Native Data Pipelines
Dok Talks #119 - Cloud-Native Data Pipelines
 
Open cloud infrastructure built for the enterprise
Open cloud infrastructure built for the enterpriseOpen cloud infrastructure built for the enterprise
Open cloud infrastructure built for the enterprise
 
Kubernetes - A Short Ride Throught the project and its ecosystem
Kubernetes - A Short Ride Throught the project and its ecosystemKubernetes - A Short Ride Throught the project and its ecosystem
Kubernetes - A Short Ride Throught the project and its ecosystem
 
The road to enterprise ready open stack storage as service
The road to enterprise ready open stack storage as serviceThe road to enterprise ready open stack storage as service
The road to enterprise ready open stack storage as service
 
Project Solum - OpenStack's Native PaaS
Project Solum - OpenStack's Native PaaSProject Solum - OpenStack's Native PaaS
Project Solum - OpenStack's Native PaaS
 
Build Your Own PaaS, Just like Red Hat's OpenShift from LinuxCon 2013 New Orl...
Build Your Own PaaS, Just like Red Hat's OpenShift from LinuxCon 2013 New Orl...Build Your Own PaaS, Just like Red Hat's OpenShift from LinuxCon 2013 New Orl...
Build Your Own PaaS, Just like Red Hat's OpenShift from LinuxCon 2013 New Orl...
 
Apache Mesos Overview and Integration
Apache Mesos Overview and IntegrationApache Mesos Overview and Integration
Apache Mesos Overview and Integration
 
"Using Automation Tools To Deploy And Operate Applications In Real World Scen...
"Using Automation Tools To Deploy And Operate Applications In Real World Scen..."Using Automation Tools To Deploy And Operate Applications In Real World Scen...
"Using Automation Tools To Deploy And Operate Applications In Real World Scen...
 
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius Schumacher
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius SchumacherOSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius Schumacher
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius Schumacher
 
Mastering Terraform and the Provider for OCI
Mastering Terraform and the Provider for OCIMastering Terraform and the Provider for OCI
Mastering Terraform and the Provider for OCI
 
Open shift 4 infra deep dive
Open shift 4    infra deep diveOpen shift 4    infra deep dive
Open shift 4 infra deep dive
 
Heat - keep the clouds up
Heat - keep the clouds upHeat - keep the clouds up
Heat - keep the clouds up
 
Red Hat presentatie: Open stack Latest Pure Tech
Red Hat presentatie: Open stack Latest Pure TechRed Hat presentatie: Open stack Latest Pure Tech
Red Hat presentatie: Open stack Latest Pure Tech
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
Ansible + Hadoop
Ansible + HadoopAnsible + Hadoop
Ansible + Hadoop
 
Red Hat Enteprise Linux Open Stack Platfrom Director
Red Hat Enteprise Linux Open Stack Platfrom DirectorRed Hat Enteprise Linux Open Stack Platfrom Director
Red Hat Enteprise Linux Open Stack Platfrom Director
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
 

Similar to Spark China Summit 2015 Guancheng Chen

PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupHolden Karau
 
Exploring a simpler, more portable, less overhead solution to deploy Elastics...
Exploring a simpler, more portable, less overhead solution to deploy Elastics...Exploring a simpler, more portable, less overhead solution to deploy Elastics...
Exploring a simpler, more portable, less overhead solution to deploy Elastics...LetsConnect
 
Pydata 2020 containers meetup
Pydata  2020 containers meetup Pydata  2020 containers meetup
Pydata 2020 containers meetup Walid Shaari
 
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdfRun Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdfAnya Bida
 
CENGN - OpenStack MeetUp - March 2017
CENGN - OpenStack MeetUp - March 2017CENGN - OpenStack MeetUp - March 2017
CENGN - OpenStack MeetUp - March 2017Stacy Véronneau
 
OpenStack London Meetup, 18 Nov 2015
OpenStack London Meetup, 18 Nov 2015OpenStack London Meetup, 18 Nov 2015
OpenStack London Meetup, 18 Nov 2015Jesse Pretorius
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...NETWAYS
 
Big data analytics and docker the thrilla in manila
Big data analytics and docker  the thrilla in manilaBig data analytics and docker  the thrilla in manila
Big data analytics and docker the thrilla in manilaDean Hildebrand
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
 
Openstack India May Meetup
Openstack India May MeetupOpenstack India May Meetup
Openstack India May MeetupDeepak Garg
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
 
Dockerizing OpenStack for High Availability
Dockerizing OpenStack for High AvailabilityDockerizing OpenStack for High Availability
Dockerizing OpenStack for High AvailabilityDaniel Krook
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
 
Webinar- Tea for the Tillerman
Webinar- Tea for the TillermanWebinar- Tea for the Tillerman
Webinar- Tea for the TillermanCumulus Networks
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Spark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesSpark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesYousun Jeong
 

Similar to Spark China Summit 2015 Guancheng Chen (20)

PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March Meetup
 
Exploring a simpler, more portable, less overhead solution to deploy Elastics...
Exploring a simpler, more portable, less overhead solution to deploy Elastics...Exploring a simpler, more portable, less overhead solution to deploy Elastics...
Exploring a simpler, more portable, less overhead solution to deploy Elastics...
 
Pydata 2020 containers meetup
Pydata  2020 containers meetup Pydata  2020 containers meetup
Pydata 2020 containers meetup
 
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdfRun Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
 
CENGN - OpenStack MeetUp - March 2017
CENGN - OpenStack MeetUp - March 2017CENGN - OpenStack MeetUp - March 2017
CENGN - OpenStack MeetUp - March 2017
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
OpenStack London Meetup, 18 Nov 2015
OpenStack London Meetup, 18 Nov 2015OpenStack London Meetup, 18 Nov 2015
OpenStack London Meetup, 18 Nov 2015
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
 
Big data analytics and docker the thrilla in manila
Big data analytics and docker  the thrilla in manilaBig data analytics and docker  the thrilla in manila
Big data analytics and docker the thrilla in manila
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Openstack India May Meetup
Openstack India May MeetupOpenstack India May Meetup
Openstack India May Meetup
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
 
Dockerizing OpenStack for High Availability
Dockerizing OpenStack for High AvailabilityDockerizing OpenStack for High Availability
Dockerizing OpenStack for High Availability
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
Webinar- Tea for the Tillerman
Webinar- Tea for the TillermanWebinar- Tea for the Tillerman
Webinar- Tea for the Tillerman
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Spark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesSpark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on Kubernetes
 
Santhosh resume
Santhosh resumeSanthosh resume
Santhosh resume
 

Spark China Summit 2015 Guancheng Chen

  • 1. SuperVessel: Enabling Spark as a Service with OpenStack and Docker Guan Cheng (G.C.) Chen IBM Research - China weibo.com/parallellabs
  • 2. SuperVessel Cloud 4/18/15 IBM  Research  -­‐  China 2   •  Public  cloud  built  on  the  POWER7/POWER8  servers  with  OpenStack   •  It  provides  free  access  for  students,  researchers,  developers  across  the  world,  and   helps  grow  OpenPOWER  ecosystem  (used  in  30+  universiNes  now)   •  It  provides  advanced  technology  services  such  as  Spark  as  a  Service,  Docker  Services,   CogniNve  CompuNng  Service,  IoT  Service,  Accelerator  as  a  Service  (FPGA  and  GPU)  
  • 3. Spark as a Service 4/18/15 IBM  Research  -­‐  China 3   Step  1:  Login Step  2:  Create Step  3:  Ready! 3 steps to launch a Spark cluster, easy! Try  it  at:  www.ptopenlab.com
  • 4. Why OpenStack? 4/18/15 IBM  Research  -­‐  China 4   •  Most  popular  IaaS  soXware   •  Supports  Docker*   •  Heat  can  orchestra  docker  containers  easily  –  good  for  provision  a  Spark   cluster   Picture  source:  h-p://www.qyjohn.net/?p=3801
  • 5. Why Docker? •  Less  resource  consumpNon  than   KVM   –  We  can  provision  more  Spark   clusters!   •  Boot  faster  than  KVM   –  Users  like  fast  provision!   •  Incrementally  build,  revert  and   reuse  your  container   –  We  love  Git  and  AUFS!   •  However,  Docker  is  not  officially   supported  in  OpenStack  yet   –  Nova  Docker  is  an  external   component  of  OpenStack   •  Port  Docker  to  POWER   Architecture  (ppc64  and  ppc64le)   –  Ubuntu  15.04  includes  Docker  for   POWER8  ppc64le 4/18/15 IBM  Research  -­‐  China 5   Picture  source:  h-p://www.zdnet.com/ar@cle/what-­‐is-­‐docker-­‐and-­‐why-­‐is-­‐it-­‐so-­‐ darn-­‐popular/
  • 6. Why Spark? •  Fast   •  Unified   •  Ecosystem     PorNng  to  POWER   –  Bugfix  submieed  to  the   community   –  Spark  1.3  works   smoothly!  J   4/18/15 IBM  Research  -­‐  China 6  
  • 7. Why not Sahara? Sahara is a component for Hadoop/Spark as a Service in OpenStack •  We  started  from  OpenStack  Icehouse  …   •  DockerizaNon   – Beeer  service  deployment  and  isolaNon  for  Big   Data  dashboard  server   •  CustomizaNon   – WaiNng  for  Sahara’s  improvements  is  somehow   *SLOW*   •  Docker,  user  analyNcs,  Spark  1.4,  Spark  IDE,  scheduling,   data  visualizaNon  etc.   4/18/15 IBM  Research  -­‐  China 7  
  • 8. 4/18/15 IBM  Research  -­‐  China 8   Architecture Design Big Data Dashboard Keystone Glance Neutron Heat Nova Nova Docker Cinder/ Manila Spark Cluster Docker Image Spark Master Spark Worker Spark Worker Container 1 Container 2 Container 3 NameNode Spark Driver DataNode DataNode Billing&Auth 1 2 2 3
  • 9. Dockerize everything! •  We  use  containers  to     – run  applica<ons   – run  other  containers   – run  OpenStack  python  daemons   – run  OpenStack  services  inside  OpenStack  (with  a   special  trunk  link  in  neutron/OVS)   – run  OpenStack  inside  OpenStack   •  Good  for  mulN-­‐site  expansion   4/18/15 IBM  Research  -­‐  China 9  
  • 10. Heat template design Heat is a component for orchestration in OpenStack •  Parameters   – Cinder/Malina/Neutron  uuid   – Size  of  Cinder/Malina  resources   •  Resources   – Master/slave  node   – Neutron     – Cinder/Manila   •  Need  to  modify  nova-­‐docker  to  mount  the  cinder/ manila  resources  when  booNng  the  docker  container   4/18/15 IBM  Research  -­‐  China 10  
  • 11. Spark Docker Image •  Built  from  Ubuntu  14.04.1   •  All  Spark  nodes  use  the  same  image,  with   different  iniNalizaNon  scripts  by  using  cloudinit   •  IniNalizaNon  scripts  will   – Sync  /etc/hosts  across  all  nodes   – Set  HDFS  and  Spark  configuraNons  accordingly   – Format  HDFS  and  launch  HDFS   – Launch  Spark 4/18/15 IBM  Research  -­‐  China 11  
  • 12. Big Data Dashboard Development •  2  Developers  (frond  end  +  backend)   – Online  in  2  months   – Separates  Heat  related  stuff  and  Dashboard   – Reskul  API  (for  billing  and  authenNcaNon  etc)   – Dockerize  the  big  data  dashboard  server   – Separates  development  and  producNon   environment 4/18/15 IBM  Research  -­‐  China 12  
  • 13. Where should I put the data? Shared File System for Cloud and Spark as a Service 4/18/15 IBM  Research  -­‐  China 13   Docker (Symphon y) Horizon OpenStack controller HEAT NeutronGlance Manila Nova Cloud  Infrastructure   Service   Big  Data  Service   •  Select  Big  data  compuNng   framework  (Mapreduce,   SPARK   •  Select  cluster  size   •  Select  data  folder  size HEAT  template  for  big  data  cluster Docker (Symphon y) Docker (Symphon y) Docker (Symphon y) Docker (Symphon y) Docker (SPARK) POWER7/POWER8 KVM/ Docker (Web app) Folder  A User  BUser  A Folder  B User  A •  HEAT  will  orchestrate  docker  instances,  subnet  and  data  folder  based  on  user’s  request   •  Manila  provides  the  NFS  service  using  GPFS  as  backend,  and  the  folder  will  be  mounted  via  nova-­‐docker  (with  -­‐v  support)   •  Folder  created  by  Manila  could  be  accessed  by  the  KVM/docker  instances  created  for  big  data  and  other  purpose   GPFS  FPO POWER7/POWER8 Servers GPFS  FPO Servers GPFS  FPO KeyStone Cinder
  • 14. SuperVessel Services Roadmap 4/18/15 IBM  Research  -­‐  China 14   SuperVessel  Cloud  Infrastructure   SuperVessel   Cloud     Service SuperVessel     Big  Data  and  HPC   Service Super     Class   Service OpenPOWER   Enablement   Service Super  Project   Team   Service Super  Marketplace 1.  VM  and  container   service   2.  Storage  service   3.  Network  service   4.  Accelerator  as   service   5.  Image  service   1.  Big  Data:   MapReduce   (Symphony),  SPARK   2.  Performance  tuning   service 1.  X-­‐to-­‐P  migraNon:   AutoPort  tool   2.  OpenPOWER  new   system  test   service 1.  On-­‐line  video   courses   2.  Teacher  course   management   3.  User  contribuNon   management   1.  Project   management   service   2.  DevOps   automaNon Storage IBM  POWER  servers OpenPOWER  server FPGA/GPU Docker (Online) (Online) (Preparing) (Online)
  • 15. Summary •  Spark  +  OpenStack  +  Docker  works  very  well  on  OpenPOWER   servers   •  Dockerized  services  made  DevOps  easier   •  Docker  issues   –  Zombie  process   –  Can’t  dynamically  aeach  volume   –  Commit,  aeach  operaNons  is  not  supported  on  Nova-­‐docker   •  Monitoring  everything  (API,  resources  etc)   –  key  for  operaNng  a  cloud   •  TODO:  Spark  1.4,  Spark  IDE,  SWIFT,  Data  VisualizaNon   4/18/15 IBM  Research  -­‐  China 15  
  • 16. 4/18/15 IBM  Research  -­‐  China 16   Join Us! www.ptopenlab.com QQ  group:  SuperVessel SuperVessel   WeChat  group @冠诚   chengc@cn.ibm.com