SlideShare une entreprise Scribd logo
1  sur  36
FUTURE OF DATA PLATFORM IN CLOUD
NATIVE ERA
- Srivatsan Srinivasan
WHO AM I?
Chief Data Scientist at Cognizant
https://www.linkedin.com/in/srivatsan-srinivasan-b8131b/
https://www.youtube.com/channel/UCwBs8TLOogwyGd0GxHCp-Dw
AIEngineering
Cloud Native Data Application
Edge AI/Analytics
Hybrid Cloud
Prescriptive Analytics (From what to why)
Augmented Analytics
BACKGROUND FOR THIS TALK
Is it really End of Hadoop Era?
Is it really End of Hadoop Era?
• It did not live up with performance need of
Organization
• It was not able to replace existing EDW
Infrastructure
• It is too Hard to maintain and even hard for it
being cloud ready
• Cloud killed Hadoop
Is it really End of Hadoop Era?
• People failed Hadoop. It is people who did not
know what use case best fitted Hadoop
• People who were trying to solve technology
problem rather business problem
• Hadoop Architecture needs a Refresh in todays
world
• Underlying assumptions on which Hadoop was
created decade back is no longer relevant for
years now
• There is better way of doing Hadoop on premise
CHALLENGES WITH BIG DATA
PLATFORM
CHALLENGE 1 – Separate Data and Application Infrastructure
Data Infrastructure Application Infrastructure
CHALLENGE 1 – Separate Data and Application Infrastructure
 Separate Infrastructure management
 Separate Dev Ops/Data Ops
 Not so efficient use of Infrastructure and
Specialized hardware accelerators
 Application have to re-written during
movement from one environment to another
CHALLENGE 2 – Difficult Dependency Management
CHALLENGE 2 – Difficult Dependency Management
CHALLENGE 2 – Difficult Dependency and Version Management
 Data Scientist need access to latest and
greatest version
 Interdependency between multiple versions
 Yarn does not provide way to isolate
dependency easily
 Package dependency during spark-submit
 Create different conda environment per
project
CHALLENGE 3 – Portability to Hybrid Infrastructure
On Premise
Application Application
Public Cloud
Pattern 1 – Build On premise and Deploy on Cloud
On Premise
(Primary)
Application Application
Public Cloud
(DR)
Pattern 2 – Primary On premise and DR on Cloud
Failover
Pattern 3 – Cloud Bursting
On Premise
(Primary Infra)
Application Application
Public Cloud
(Extended Infra)
Bust on
demand
On Premise
(Sensitive Data)
Application Application
Public Cloud
(Non sensitive data)
Pattern 4 – Placement based on Data Sensitivity and Data Gravity
CHALLENGE 4 – Reproducibility from development to production
CHALLENGES – Others
 Spark version upgrade – All tenants impacted
 Difficult defining deployment strategies like Champion/Challenger
deployment
 Data Locality - Linearly scale storage and compute
 All data has to be together
FUTURE OF DATA ARCHITECTURE
What Happened?
More’s law on Bandwidth happened making data locality not so important
Containers and Kubernetes happened making Yarn exclusive to few data
applications
Cloud Storage happened making Hadoop storage not so cheap (With Caveat
though..)
Apache Hadoop and supporting distributed systems were built in a world
were underlying assumptions were different than what it is today
What happened today?
What do we really need?
 Common run time layer across your private and public cloud
 Abstract away dependency and version conflicts
 Efficient usage of existing infrastructure
 Consistent tooling and CI/CD process across environments to increase
efficiency
 Avoid vendor lock in for vendor portability
 Handle Bursty workload
 Time to provision new environments and agility to test latest offering
Converged Infrastructure and Consistent Tooling
Data Applications Other Application
Kubernetes
Infrastructure
Converged Infrastructure and Consistent Tooling
Operator Support for Data Application
Spark Operator
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
Kafka Operator
https://www.confluent.io/confluent-operator/
https://github.com/strimzi/strimzi-kafka-operator
Flink Operator
https://github.com/GoogleCloudPlatform/flink-on-k8s-operator
Airflow Operator
https://github.com/GoogleCloudPlatform/airflow-operator
Step 1: Decouple compute and storage
S3, HDFS, GPFS, MapR-FS Spark
• Compute not being bound to storage. At same time use existing enterprise data storage if exists
• Assumes network throughput is higher
• Adds 2 to 6% latency depending on use case
Step 1: Decouple compute and storage
S3, HDFS, GPFS, MapR-FS Spark
Compute nodes can be adjusted to compute needs and Storage can scale independently
Step 1: Decouple compute and storage
S3, GCS, Azure Blob Spark
Cloud Ready
Spark on Kubernetes – Native Support
spark-submit 
--master k8s://<kubeserver>:<port> 
--deploy-mode cluster 
--name spark-tensorflow
--conf spark.executor.instances=4 
--conf spark.kubernetes.container.image=pyspark-tf:v2.4.3 
--conf spark.kubernetes.namespace=user1 
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark 
--conf spark.kubernetes.pyspark.pythonVersion=3 
local:///app/model/train/spark_tf.py
Spark on Kubernetes – Native Support
Source: Google Cloud
Spark on Kubernetes – Native Support
spark-submit 
--master k8s://<kubeserver>:<port> 
--deploy-mode cluster 
--name spark-tensorflow
--conf spark.executor.instances=4 
--conf spark.kubernetes.container.image=pyspark-tf:v2.4.3 
--conf spark.kubernetes.namespace=user1 
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark 
--conf spark.kubernetes.pyspark.pythonVersion=3 
local:///app/model/train/spark_tf.py
Kubernetes Operator
Automates deployment of
application
Operator is an method of
packaging, deploying and
managing instances of complex
stateful applications
It builds upon the basic
Kubernetes resource and
controller concepts but includes
domain or application-specific
knowledge to automate
common tasks
Spark Operator Stack
Infrastructure
Source: cern.ch
Spark Application Definition
Spark Operator
Spark Operator
Spark Operator controller watches for
create/delete/update events of
SparkApplication
Submission runner runs spark-
submit for submissions received from
the controller
Spark Operator
Spark Pod Monitor reports updates of
pods to controller
Mutating Admission WebHook handles
customization of Spark driver and
executor pods
IS IT PRIMETIME READY?
QUESTIONS?

Contenu connexe

Tendances

Microsoft SharePoint
Microsoft SharePointMicrosoft SharePoint
Microsoft SharePoint
David J Rosenthal
 
Industrial IoT and OT/IT Convergence
Industrial IoT and OT/IT ConvergenceIndustrial IoT and OT/IT Convergence
Industrial IoT and OT/IT Convergence
Michelle Holley
 

Tendances (20)

“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
 
Apply MLOps at Scale by H&M
Apply MLOps at Scale by H&MApply MLOps at Scale by H&M
Apply MLOps at Scale by H&M
 
Data Engineering.pdf
Data Engineering.pdfData Engineering.pdf
Data Engineering.pdf
 
How to Improve Data Analysis Through Visualization in Tableau
How to Improve Data Analysis Through Visualization in TableauHow to Improve Data Analysis Through Visualization in Tableau
How to Improve Data Analysis Through Visualization in Tableau
 
Microsoft SharePoint
Microsoft SharePointMicrosoft SharePoint
Microsoft SharePoint
 
Cloud computing presentation(ppt)
Cloud  computing presentation(ppt)Cloud  computing presentation(ppt)
Cloud computing presentation(ppt)
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflow
 
20 Cloud Computing Quotes You Can't Miss
20 Cloud Computing Quotes You Can't Miss20 Cloud Computing Quotes You Can't Miss
20 Cloud Computing Quotes You Can't Miss
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 
Industrial IoT and OT/IT Convergence
Industrial IoT and OT/IT ConvergenceIndustrial IoT and OT/IT Convergence
Industrial IoT and OT/IT Convergence
 
Chap 1 introduction to cloud computing
Chap 1 introduction to cloud computingChap 1 introduction to cloud computing
Chap 1 introduction to cloud computing
 
The Persona-Based Value of Modern Data Governance
The Persona-Based Value of Modern Data Governance The Persona-Based Value of Modern Data Governance
The Persona-Based Value of Modern Data Governance
 
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageEnd to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
INDUSTRY 4.0 (Economics for Engineers)
INDUSTRY 4.0 (Economics for Engineers)INDUSTRY 4.0 (Economics for Engineers)
INDUSTRY 4.0 (Economics for Engineers)
 
Big data
Big dataBig data
Big data
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 

Similaire à Future of Data Platform in Cloud Native world

Similaire à Future of Data Platform in Cloud Native world (20)

How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Alluxio Use Cases and Future Directions
Alluxio Use Cases and Future DirectionsAlluxio Use Cases and Future Directions
Alluxio Use Cases and Future Directions
 
Cloud-Native Data: What data questions to ask when building cloud-native apps
Cloud-Native Data: What data questions to ask when building cloud-native appsCloud-Native Data: What data questions to ask when building cloud-native apps
Cloud-Native Data: What data questions to ask when building cloud-native apps
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Hybrid Cloud Point of View - IBM Event, 2015
Hybrid Cloud Point of View - IBM Event, 2015Hybrid Cloud Point of View - IBM Event, 2015
Hybrid Cloud Point of View - IBM Event, 2015
 
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
 
Building Cloud capability for startups
Building Cloud capability for startupsBuilding Cloud capability for startups
Building Cloud capability for startups
 
Orchestrate a Data Symphony
Orchestrate a Data SymphonyOrchestrate a Data Symphony
Orchestrate a Data Symphony
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deck
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
 
Breaking the Monolith
Breaking the MonolithBreaking the Monolith
Breaking the Monolith
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
The Cloud Revolution - Philippines Cloud Summit
The Cloud Revolution - Philippines Cloud SummitThe Cloud Revolution - Philippines Cloud Summit
The Cloud Revolution - Philippines Cloud Summit
 
Best practices for application migration to public clouds interop presentation
Best practices for application migration to public clouds interop presentationBest practices for application migration to public clouds interop presentation
Best practices for application migration to public clouds interop presentation
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
 

Dernier

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 

Dernier (20)

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 

Future of Data Platform in Cloud Native world

  • 1. FUTURE OF DATA PLATFORM IN CLOUD NATIVE ERA - Srivatsan Srinivasan
  • 2. WHO AM I? Chief Data Scientist at Cognizant https://www.linkedin.com/in/srivatsan-srinivasan-b8131b/ https://www.youtube.com/channel/UCwBs8TLOogwyGd0GxHCp-Dw AIEngineering
  • 3. Cloud Native Data Application Edge AI/Analytics Hybrid Cloud Prescriptive Analytics (From what to why) Augmented Analytics
  • 5. Is it really End of Hadoop Era?
  • 6. Is it really End of Hadoop Era? • It did not live up with performance need of Organization • It was not able to replace existing EDW Infrastructure • It is too Hard to maintain and even hard for it being cloud ready • Cloud killed Hadoop
  • 7. Is it really End of Hadoop Era? • People failed Hadoop. It is people who did not know what use case best fitted Hadoop • People who were trying to solve technology problem rather business problem • Hadoop Architecture needs a Refresh in todays world • Underlying assumptions on which Hadoop was created decade back is no longer relevant for years now • There is better way of doing Hadoop on premise
  • 8. CHALLENGES WITH BIG DATA PLATFORM
  • 9. CHALLENGE 1 – Separate Data and Application Infrastructure Data Infrastructure Application Infrastructure
  • 10. CHALLENGE 1 – Separate Data and Application Infrastructure  Separate Infrastructure management  Separate Dev Ops/Data Ops  Not so efficient use of Infrastructure and Specialized hardware accelerators  Application have to re-written during movement from one environment to another
  • 11. CHALLENGE 2 – Difficult Dependency Management
  • 12. CHALLENGE 2 – Difficult Dependency Management
  • 13. CHALLENGE 2 – Difficult Dependency and Version Management  Data Scientist need access to latest and greatest version  Interdependency between multiple versions  Yarn does not provide way to isolate dependency easily  Package dependency during spark-submit  Create different conda environment per project
  • 14. CHALLENGE 3 – Portability to Hybrid Infrastructure On Premise Application Application Public Cloud Pattern 1 – Build On premise and Deploy on Cloud On Premise (Primary) Application Application Public Cloud (DR) Pattern 2 – Primary On premise and DR on Cloud Failover Pattern 3 – Cloud Bursting On Premise (Primary Infra) Application Application Public Cloud (Extended Infra) Bust on demand On Premise (Sensitive Data) Application Application Public Cloud (Non sensitive data) Pattern 4 – Placement based on Data Sensitivity and Data Gravity
  • 15. CHALLENGE 4 – Reproducibility from development to production
  • 16. CHALLENGES – Others  Spark version upgrade – All tenants impacted  Difficult defining deployment strategies like Champion/Challenger deployment  Data Locality - Linearly scale storage and compute  All data has to be together
  • 17. FUTURE OF DATA ARCHITECTURE
  • 18. What Happened? More’s law on Bandwidth happened making data locality not so important Containers and Kubernetes happened making Yarn exclusive to few data applications Cloud Storage happened making Hadoop storage not so cheap (With Caveat though..) Apache Hadoop and supporting distributed systems were built in a world were underlying assumptions were different than what it is today What happened today?
  • 19. What do we really need?  Common run time layer across your private and public cloud  Abstract away dependency and version conflicts  Efficient usage of existing infrastructure  Consistent tooling and CI/CD process across environments to increase efficiency  Avoid vendor lock in for vendor portability  Handle Bursty workload  Time to provision new environments and agility to test latest offering
  • 20. Converged Infrastructure and Consistent Tooling Data Applications Other Application Kubernetes Infrastructure
  • 21. Converged Infrastructure and Consistent Tooling
  • 22. Operator Support for Data Application Spark Operator https://github.com/GoogleCloudPlatform/spark-on-k8s-operator Kafka Operator https://www.confluent.io/confluent-operator/ https://github.com/strimzi/strimzi-kafka-operator Flink Operator https://github.com/GoogleCloudPlatform/flink-on-k8s-operator Airflow Operator https://github.com/GoogleCloudPlatform/airflow-operator
  • 23. Step 1: Decouple compute and storage S3, HDFS, GPFS, MapR-FS Spark • Compute not being bound to storage. At same time use existing enterprise data storage if exists • Assumes network throughput is higher • Adds 2 to 6% latency depending on use case
  • 24. Step 1: Decouple compute and storage S3, HDFS, GPFS, MapR-FS Spark Compute nodes can be adjusted to compute needs and Storage can scale independently
  • 25. Step 1: Decouple compute and storage S3, GCS, Azure Blob Spark Cloud Ready
  • 26. Spark on Kubernetes – Native Support spark-submit --master k8s://<kubeserver>:<port> --deploy-mode cluster --name spark-tensorflow --conf spark.executor.instances=4 --conf spark.kubernetes.container.image=pyspark-tf:v2.4.3 --conf spark.kubernetes.namespace=user1 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.pyspark.pythonVersion=3 local:///app/model/train/spark_tf.py
  • 27. Spark on Kubernetes – Native Support Source: Google Cloud
  • 28. Spark on Kubernetes – Native Support spark-submit --master k8s://<kubeserver>:<port> --deploy-mode cluster --name spark-tensorflow --conf spark.executor.instances=4 --conf spark.kubernetes.container.image=pyspark-tf:v2.4.3 --conf spark.kubernetes.namespace=user1 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.pyspark.pythonVersion=3 local:///app/model/train/spark_tf.py
  • 29. Kubernetes Operator Automates deployment of application Operator is an method of packaging, deploying and managing instances of complex stateful applications It builds upon the basic Kubernetes resource and controller concepts but includes domain or application-specific knowledge to automate common tasks
  • 33. Spark Operator Spark Operator controller watches for create/delete/update events of SparkApplication Submission runner runs spark- submit for submissions received from the controller
  • 34. Spark Operator Spark Pod Monitor reports updates of pods to controller Mutating Admission WebHook handles customization of Spark driver and executor pods
  • 35. IS IT PRIMETIME READY?