SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Building a cloud
based managed BigData platform
Hemanth Yamijala
Lead Consultant - ThoughtWorks
yhemanth@thoughtworks.com
@yhemanth
Pillars in a BigData Solution
BigData

Infrastructure

Data

Process
A Managed Platform
BigData

Services
Infrastructure

Services
Data

Process
Reuse infrastructure
BigData
•Consolidate cluster
resources

•Saves capacity cost
•Saves operational
cost

•Enforce common
Infrastructure

access control and
security measures

Data

Process
Reuse Data
BigData
•Democratize data
assets

•Assist self service,
discoverability

•Conventions based
organization of data

•Enforce access
policies
Infrastructure

Data

Process
Reuse process
BigData

•Build common processing

frameworks or libraries
•Ingest and Extract can be
centralized services
•Frameworks can be
developed for ETL
processes, workflows, etc.

•Save time in building
Infrastructure

Data

analytical solutions

Process
Other Reasons
• Develop and leverage skill set of people
• Separating concerns of running
applications vs running infrastructure

• Evaluate and adopt new developments in
the space
Flavors of managed BigData platforms

•
•

Physical data centers
Private or Public clouds

•
•

Infrastructure Providers:

•

Amazon Web Services, Google Compute Engine,
Microsoft Azure, IBM, Open Stack, Rackspace

Platform Providers:

•
•

Qubole, Xurmo
In-House: Netflix, ...
Architectural Layers
Enterprise User Data / Workloads

User Data / Workloads

Enterprise Managed BigData
Services (E.g. Netflix Genie)

Managed BigData Services (E.g. EMR, Savanna, Redshift)

Cloud Storage (E.g. S3, Swift)

Virtualized Compute (E.g. EC2,
Nova)
Components in a managed platform
Presentation

Command Line Tools

API

Analytics Workbench

Data analytics

Data Catalog

Query

Aggregates

ETL

Platform

Ingest

FileSystem

Workflow

Provisioning

Scheduler

Job Management

Extract

Access Control

Eventing

Infrastructure

Redshift

Data

S3

EMR

Compute

IAM

Identity

SNS

Infrastructure
Elastic MapReduce - 101
•

Provision a Hadoop cluster of given size, using given type
of instances

•
•
•
•
•
•

Support for most of the ecosystem- Hive, Pig, HBase, etc.
Can scale up and down nodes for a cluster on demand
User submits ‘jobflows’ - a sequence of Hadoop jobs
Integrates with S3 as permanent store of data
Integrates with other Amazon services
Cost = Std. EC2 instance cost + extra + Std s3 ops etc.
Reasons for having Enterprise Tier on EMR

• Improve usability by providing better
abstractions, necessary automation

• Improve cost utilization by reusing
infrastructure

• Improve performance by providing system
level optimizations
Improving Usability
•

EMR API expects some
repetitive setup steps as
part of job submission. E.g.
Hive setup for all Hive jobs

•

Provide a service API with a
simpler interface that
automates the setup.
Improving Usability
{"steps": [
{
"stepActionOnFailure": "CONTINUE",
"stepName": "Setup Hive",
"stepArgs": [
"s3://us-east-1.elasticmapreduce/libs/hive/hive-script",
"--base-path",
"s3://us-east-1.elasticmapreduce/libs/hive/",
"--install-hive",
"--hive-versions",
"latest"
],
"stepJar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar"
},
{
"stepActionOnFailure": "CONTINUE",
"stepName": "Install Hive Site Configuration",
"stepArgs": [
"s3://us-east-1.elasticmapreduce/libs/hive/hive-script",
"--base-path",
"s3://us-east-1.elasticmapreduce/libs/hive/",
"--install-hive-site",
"--hive-site=s3://com.x.y.z/security/configs/hive-site.xml",
"--hive-versions",
"latest"
],
"stepJar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar"
},
{
"stepActionOnFailure": "TERMINATE_JOB_FLOW",
"stepName": "Run Hive Script",
"stepArgs": [
"s3://us-east-1.elasticmapreduce/libs/hive/hive-script",
"--base-path",
"s3://us-east-1.elasticmapreduce/libs/hive/",

jobs: [{
"name": "hive-query",
"type": "hive",
"args": [
"-hiveconf",
"hive.cli.print.header=true"
],
"script": "select * from table;"
}
]
Improving Usability
•

Separate cluster management
from job management.

•

EMR expects users to
know the cluster sizes
when launching jobs

•

Have the system (or
administrators) launch
clusters on behalf of users

•

Users will either not
know how to launch
clusters, or will launch
incorrectly sized ones.

•

Have the system submit jobs
to appropriate clusters

•

Scale them according to the
needs of the jobs automatically or
administratively
Improve cost utilization

•

Different cluster types
in EMR: ephemeral
(default) and static

•

Ephemeral clusters can
be a huge cost drain Note: minimum charges
for a hour

•

Static clusters can also
waste money (if unused)

•

Go with a Hybrid model
Launch clusters on demand,
but maximize the cost to
utilization ratio - keep them
alive at least for an hour

•
•

•

Reuse them for other jobs
transparently

•

Shutdown if not used anymore
Saved $3000 in a month with
this strategy
Job Management System Design
Job
Management
Service

Job Executor

Resource
Estimator

Cluster
Manager
Job Management System Design
Manage provisioning,
monitoring and terminating
clusters. Matches job
requests to suitable clusters
based on policy

Job
Management
Service

Job Executor

Resource
Estimator

Cluster
Manager
Job Management System Design
Pool of clusters brought up
either on demand or predetermined, based on
requirements of resource
requirements, longevity, etc.

Job
Management
Service

Job Executor

Resource
Estimator

Cluster
Manager
Job Management System Design
Has knowledge of how to
convert a user jobflow to an
EMR jobflow. Also knows
how to submit jobflows to
clusters identified by cluster
manager

Job
Management
Service

Job Executor

Resource
Estimator

Cluster
Manager
Job Management System Design
Job
Management
Service

Monitors running jobs on
clusters using CloudWatch
(or similar system), and
determines whether to add /
delete more nodes to a
cluster

Job Executor

Resource
Estimator

Cluster
Manager
Job Management System Design
Job
Management
Service

Front-end service API for
users to submit their jobs.

Job Executor

Resource
Estimator

Cluster
Manager
Thank you!
http://www.thoughtworks.com/insights/bigdata-analytics

Contenu connexe

Tendances

PaaSport to Paradise: Lifting & Shifting with Azure SQL Database/Managed Inst...
PaaSport to Paradise: Lifting & Shifting with Azure SQL Database/Managed Inst...PaaSport to Paradise: Lifting & Shifting with Azure SQL Database/Managed Inst...
PaaSport to Paradise: Lifting & Shifting with Azure SQL Database/Managed Inst...Sandy Winarko
 
BlueData EPIC 2.0 Overview
BlueData EPIC 2.0 OverviewBlueData EPIC 2.0 Overview
BlueData EPIC 2.0 OverviewBlueData, Inc.
 
Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...
Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...
Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...Lviv Startup Club
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent
 
StackVelocity Overview
StackVelocity OverviewStackVelocity Overview
StackVelocity OverviewStackVelocity
 
Modern data warehouse with Azure
Modern data warehouse with AzureModern data warehouse with Azure
Modern data warehouse with AzureNilesh Gule
 
Webinar: Simplifying the Enterprise Hybrid Cloud with Azure Stack HCI
Webinar: Simplifying the Enterprise Hybrid Cloud with Azure Stack HCIWebinar: Simplifying the Enterprise Hybrid Cloud with Azure Stack HCI
Webinar: Simplifying the Enterprise Hybrid Cloud with Azure Stack HCIStorage Switzerland
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 
Tokyo azure meetup #2 big data made easy
Tokyo azure meetup #2   big data made easyTokyo azure meetup #2   big data made easy
Tokyo azure meetup #2 big data made easyTokyo Azure Meetup
 
Part 3 - Modern Data Warehouse with Azure Synapse
Part 3 - Modern Data Warehouse with Azure SynapsePart 3 - Modern Data Warehouse with Azure Synapse
Part 3 - Modern Data Warehouse with Azure SynapseNilesh Gule
 
When networks meets apps (open stack atlanta)
When networks meets apps (open stack atlanta)When networks meets apps (open stack atlanta)
When networks meets apps (open stack atlanta)Nati Shalom
 
Choosing the right Cloud Database
Choosing the right Cloud DatabaseChoosing the right Cloud Database
Choosing the right Cloud DatabaseJanakiram MSV
 
Application Centric DevOps
Application Centric DevOpsApplication Centric DevOps
Application Centric DevOpsNati Shalom
 
6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)
6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)
6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)Lucas Jellema
 
IMC Summit 2016 Breakout - Nikita Ivanov - Shared In-Memory RDDs – Missing Li...
IMC Summit 2016 Breakout - Nikita Ivanov - Shared In-Memory RDDs – Missing Li...IMC Summit 2016 Breakout - Nikita Ivanov - Shared In-Memory RDDs – Missing Li...
IMC Summit 2016 Breakout - Nikita Ivanov - Shared In-Memory RDDs – Missing Li...In-Memory Computing Summit
 
Coursera's Adoption of Cassandra
Coursera's Adoption of CassandraCoursera's Adoption of Cassandra
Coursera's Adoption of CassandraDataStax Academy
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesQubole
 
Unified Data Access with Gimel
Unified Data Access with GimelUnified Data Access with Gimel
Unified Data Access with GimelAlluxio, Inc.
 

Tendances (20)

R in Power BI
R in Power BIR in Power BI
R in Power BI
 
PaaSport to Paradise: Lifting & Shifting with Azure SQL Database/Managed Inst...
PaaSport to Paradise: Lifting & Shifting with Azure SQL Database/Managed Inst...PaaSport to Paradise: Lifting & Shifting with Azure SQL Database/Managed Inst...
PaaSport to Paradise: Lifting & Shifting with Azure SQL Database/Managed Inst...
 
BlueData EPIC 2.0 Overview
BlueData EPIC 2.0 OverviewBlueData EPIC 2.0 Overview
BlueData EPIC 2.0 Overview
 
Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...
Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...
Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
 
StackVelocity Overview
StackVelocity OverviewStackVelocity Overview
StackVelocity Overview
 
Modern data warehouse with Azure
Modern data warehouse with AzureModern data warehouse with Azure
Modern data warehouse with Azure
 
Webinar: Simplifying the Enterprise Hybrid Cloud with Azure Stack HCI
Webinar: Simplifying the Enterprise Hybrid Cloud with Azure Stack HCIWebinar: Simplifying the Enterprise Hybrid Cloud with Azure Stack HCI
Webinar: Simplifying the Enterprise Hybrid Cloud with Azure Stack HCI
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Tokyo azure meetup #2 big data made easy
Tokyo azure meetup #2   big data made easyTokyo azure meetup #2   big data made easy
Tokyo azure meetup #2 big data made easy
 
Part 3 - Modern Data Warehouse with Azure Synapse
Part 3 - Modern Data Warehouse with Azure SynapsePart 3 - Modern Data Warehouse with Azure Synapse
Part 3 - Modern Data Warehouse with Azure Synapse
 
When networks meets apps (open stack atlanta)
When networks meets apps (open stack atlanta)When networks meets apps (open stack atlanta)
When networks meets apps (open stack atlanta)
 
Choosing the right Cloud Database
Choosing the right Cloud DatabaseChoosing the right Cloud Database
Choosing the right Cloud Database
 
Application Centric DevOps
Application Centric DevOpsApplication Centric DevOps
Application Centric DevOps
 
6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)
6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)
6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)
 
IMC Summit 2016 Breakout - Nikita Ivanov - Shared In-Memory RDDs – Missing Li...
IMC Summit 2016 Breakout - Nikita Ivanov - Shared In-Memory RDDs – Missing Li...IMC Summit 2016 Breakout - Nikita Ivanov - Shared In-Memory RDDs – Missing Li...
IMC Summit 2016 Breakout - Nikita Ivanov - Shared In-Memory RDDs – Missing Li...
 
Avoiding cloud lock-in
Avoiding cloud lock-inAvoiding cloud lock-in
Avoiding cloud lock-in
 
Coursera's Adoption of Cassandra
Coursera's Adoption of CassandraCoursera's Adoption of Cassandra
Coursera's Adoption of Cassandra
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 
Unified Data Access with Gimel
Unified Data Access with GimelUnified Data Access with Gimel
Unified Data Access with Gimel
 

Similaire à Building a cloud based managed BigData platform for the enterprise

IBM InterConnect 2015 - IIB in the Cloud
IBM InterConnect 2015 - IIB in the CloudIBM InterConnect 2015 - IIB in the Cloud
IBM InterConnect 2015 - IIB in the CloudAndrew Coleman
 
Google Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App EngineGoogle Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App EngineCsaba Toth
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseHao Chen
 
Building a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackBuilding a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackke4qqq
 
The art of .net deployment automation
The art of .net deployment automationThe art of .net deployment automation
The art of .net deployment automationMidVision
 
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...Amazon Web Services
 
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...Emerson Eduardo Rodrigues Von Staffen
 
WSO2 Quarterly Technical Update
WSO2 Quarterly Technical UpdateWSO2 Quarterly Technical Update
WSO2 Quarterly Technical UpdateWSO2
 
AWS as platform for scalable applications
AWS as platform for scalable applicationsAWS as platform for scalable applications
AWS as platform for scalable applicationsRoman Gomolko
 
Geek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsGeek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsIDERA Software
 
Analytics at the Speed of Thought: Actian Express Overview
Analytics at the Speed of Thought: Actian Express Overview Analytics at the Speed of Thought: Actian Express Overview
Analytics at the Speed of Thought: Actian Express Overview Actian Corporation
 
Navigating the turbulence on takeoff: Setting up SharePoint on Azure IaaS the...
Navigating the turbulence on takeoff: Setting up SharePoint on Azure IaaS the...Navigating the turbulence on takeoff: Setting up SharePoint on Azure IaaS the...
Navigating the turbulence on takeoff: Setting up SharePoint on Azure IaaS the...Jason Himmelstein
 
Preparing for Data Residency and Custom Domains
Preparing for Data Residency and Custom DomainsPreparing for Data Residency and Custom Domains
Preparing for Data Residency and Custom DomainsAtlassian
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceWilfried Hoge
 
UTF-8'en'IBM_Cloud_SCO_Content_20130702c
UTF-8'en'IBM_Cloud_SCO_Content_20130702cUTF-8'en'IBM_Cloud_SCO_Content_20130702c
UTF-8'en'IBM_Cloud_SCO_Content_20130702cR.gowtham kumar
 
MySQL for Oracle DBAs
MySQL for Oracle DBAsMySQL for Oracle DBAs
MySQL for Oracle DBAsBen Krug
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table NotesTimothy Spann
 

Similaire à Building a cloud based managed BigData platform for the enterprise (20)

IBM InterConnect 2015 - IIB in the Cloud
IBM InterConnect 2015 - IIB in the CloudIBM InterConnect 2015 - IIB in the Cloud
IBM InterConnect 2015 - IIB in the Cloud
 
Google Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App EngineGoogle Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App Engine
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
 
Building a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackBuilding a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStack
 
The art of .net deployment automation
The art of .net deployment automationThe art of .net deployment automation
The art of .net deployment automation
 
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
 
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
 
WSO2 Quarterly Technical Update
WSO2 Quarterly Technical UpdateWSO2 Quarterly Technical Update
WSO2 Quarterly Technical Update
 
AWS as platform for scalable applications
AWS as platform for scalable applicationsAWS as platform for scalable applications
AWS as platform for scalable applications
 
Geek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsGeek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure Environments
 
Analytics at the Speed of Thought: Actian Express Overview
Analytics at the Speed of Thought: Actian Express Overview Analytics at the Speed of Thought: Actian Express Overview
Analytics at the Speed of Thought: Actian Express Overview
 
Navigating the turbulence on takeoff: Setting up SharePoint on Azure IaaS the...
Navigating the turbulence on takeoff: Setting up SharePoint on Azure IaaS the...Navigating the turbulence on takeoff: Setting up SharePoint on Azure IaaS the...
Navigating the turbulence on takeoff: Setting up SharePoint on Azure IaaS the...
 
Preparing for Data Residency and Custom Domains
Preparing for Data Residency and Custom DomainsPreparing for Data Residency and Custom Domains
Preparing for Data Residency and Custom Domains
 
Scale By The Bay | 2020 | Gimel
Scale By The Bay | 2020 | GimelScale By The Bay | 2020 | Gimel
Scale By The Bay | 2020 | Gimel
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experience
 
UTF-8'en'IBM_Cloud_SCO_Content_20130702c
UTF-8'en'IBM_Cloud_SCO_Content_20130702cUTF-8'en'IBM_Cloud_SCO_Content_20130702c
UTF-8'en'IBM_Cloud_SCO_Content_20130702c
 
Azure full
Azure fullAzure full
Azure full
 
MySQL for Oracle DBAs
MySQL for Oracle DBAsMySQL for Oracle DBAs
MySQL for Oracle DBAs
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
 

Dernier

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Building a cloud based managed BigData platform for the enterprise

  • 1. Building a cloud based managed BigData platform Hemanth Yamijala Lead Consultant - ThoughtWorks yhemanth@thoughtworks.com @yhemanth
  • 2. Pillars in a BigData Solution BigData Infrastructure Data Process
  • 4. Reuse infrastructure BigData •Consolidate cluster resources •Saves capacity cost •Saves operational cost •Enforce common Infrastructure access control and security measures Data Process
  • 5. Reuse Data BigData •Democratize data assets •Assist self service, discoverability •Conventions based organization of data •Enforce access policies Infrastructure Data Process
  • 6. Reuse process BigData •Build common processing frameworks or libraries •Ingest and Extract can be centralized services •Frameworks can be developed for ETL processes, workflows, etc. •Save time in building Infrastructure Data analytical solutions Process
  • 7. Other Reasons • Develop and leverage skill set of people • Separating concerns of running applications vs running infrastructure • Evaluate and adopt new developments in the space
  • 8. Flavors of managed BigData platforms • • Physical data centers Private or Public clouds • • Infrastructure Providers: • Amazon Web Services, Google Compute Engine, Microsoft Azure, IBM, Open Stack, Rackspace Platform Providers: • • Qubole, Xurmo In-House: Netflix, ...
  • 9. Architectural Layers Enterprise User Data / Workloads User Data / Workloads Enterprise Managed BigData Services (E.g. Netflix Genie) Managed BigData Services (E.g. EMR, Savanna, Redshift) Cloud Storage (E.g. S3, Swift) Virtualized Compute (E.g. EC2, Nova)
  • 10. Components in a managed platform Presentation Command Line Tools API Analytics Workbench Data analytics Data Catalog Query Aggregates ETL Platform Ingest FileSystem Workflow Provisioning Scheduler Job Management Extract Access Control Eventing Infrastructure Redshift Data S3 EMR Compute IAM Identity SNS Infrastructure
  • 11. Elastic MapReduce - 101 • Provision a Hadoop cluster of given size, using given type of instances • • • • • • Support for most of the ecosystem- Hive, Pig, HBase, etc. Can scale up and down nodes for a cluster on demand User submits ‘jobflows’ - a sequence of Hadoop jobs Integrates with S3 as permanent store of data Integrates with other Amazon services Cost = Std. EC2 instance cost + extra + Std s3 ops etc.
  • 12. Reasons for having Enterprise Tier on EMR • Improve usability by providing better abstractions, necessary automation • Improve cost utilization by reusing infrastructure • Improve performance by providing system level optimizations
  • 13. Improving Usability • EMR API expects some repetitive setup steps as part of job submission. E.g. Hive setup for all Hive jobs • Provide a service API with a simpler interface that automates the setup.
  • 14. Improving Usability {"steps": [ { "stepActionOnFailure": "CONTINUE", "stepName": "Setup Hive", "stepArgs": [ "s3://us-east-1.elasticmapreduce/libs/hive/hive-script", "--base-path", "s3://us-east-1.elasticmapreduce/libs/hive/", "--install-hive", "--hive-versions", "latest" ], "stepJar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar" }, { "stepActionOnFailure": "CONTINUE", "stepName": "Install Hive Site Configuration", "stepArgs": [ "s3://us-east-1.elasticmapreduce/libs/hive/hive-script", "--base-path", "s3://us-east-1.elasticmapreduce/libs/hive/", "--install-hive-site", "--hive-site=s3://com.x.y.z/security/configs/hive-site.xml", "--hive-versions", "latest" ], "stepJar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar" }, { "stepActionOnFailure": "TERMINATE_JOB_FLOW", "stepName": "Run Hive Script", "stepArgs": [ "s3://us-east-1.elasticmapreduce/libs/hive/hive-script", "--base-path", "s3://us-east-1.elasticmapreduce/libs/hive/", jobs: [{ "name": "hive-query", "type": "hive", "args": [ "-hiveconf", "hive.cli.print.header=true" ], "script": "select * from table;" } ]
  • 15. Improving Usability • Separate cluster management from job management. • EMR expects users to know the cluster sizes when launching jobs • Have the system (or administrators) launch clusters on behalf of users • Users will either not know how to launch clusters, or will launch incorrectly sized ones. • Have the system submit jobs to appropriate clusters • Scale them according to the needs of the jobs automatically or administratively
  • 16. Improve cost utilization • Different cluster types in EMR: ephemeral (default) and static • Ephemeral clusters can be a huge cost drain Note: minimum charges for a hour • Static clusters can also waste money (if unused) • Go with a Hybrid model Launch clusters on demand, but maximize the cost to utilization ratio - keep them alive at least for an hour • • • Reuse them for other jobs transparently • Shutdown if not used anymore Saved $3000 in a month with this strategy
  • 17. Job Management System Design Job Management Service Job Executor Resource Estimator Cluster Manager
  • 18. Job Management System Design Manage provisioning, monitoring and terminating clusters. Matches job requests to suitable clusters based on policy Job Management Service Job Executor Resource Estimator Cluster Manager
  • 19. Job Management System Design Pool of clusters brought up either on demand or predetermined, based on requirements of resource requirements, longevity, etc. Job Management Service Job Executor Resource Estimator Cluster Manager
  • 20. Job Management System Design Has knowledge of how to convert a user jobflow to an EMR jobflow. Also knows how to submit jobflows to clusters identified by cluster manager Job Management Service Job Executor Resource Estimator Cluster Manager
  • 21. Job Management System Design Job Management Service Monitors running jobs on clusters using CloudWatch (or similar system), and determines whether to add / delete more nodes to a cluster Job Executor Resource Estimator Cluster Manager
  • 22. Job Management System Design Job Management Service Front-end service API for users to submit their jobs. Job Executor Resource Estimator Cluster Manager