SlideShare une entreprise Scribd logo
1  sur  31
Architectural Considerations for Big
Data Workloads on OpenStack
OpenStack Summit, Barcelona
October 27, 2016
Jonathan Chiang - Cloud Architect, Comcast
James Saint-Rossy - Principal Engineer, Comcast
Introductions
James Saint-Rossy, Principal Engineer
2
Jonathan Chiang, Cloud Architect
Chris Power, MIA
Agenda
•What we do
•Comcast’s journey with OpenStack
•Big data use cases at Comcast
•Our application profiles
•Key Objectives of Modern workloads
•Disaggregated vs Hyper-Converged
•Recommended Approaches for the different use cases
•HDFS and S3 working together
4
A Fortune 50 Company Uniquely Positioned at
the Intersection of Media and Technology
TV, Internet, Voice and Home
Cable Networks
Film
Broadcast Television
Theme Parks
Stretching the Comcast Elastic Cloud | Our
Journey with OpenStack
•Petabyte of Memory and One Million vCPU Cores in 2016
•Multi-Petabyte Ceph Block and Object Storage
•Multi-Terabyte SSD Block Storage
•Deployed across 34 Regions
• National and Regional Data Centers
•Icehouse Release Today, Moving Directly to Mitaka
Community Contributions
•Lines of code: 95,000
•Commits: 1200
•Core Developers and Reviewers on Multiple Projects
•Since Vancouver Summit (Kilo), Comcast has doubled its
upstream contributions
Big Data Use Cases at Comcast
7
Real-time
Telemetry Data
Streaming
Image Recognition
Statistical Data Analysis
Machine Learning
NoSQL Databases
Pulsar
Application Profile
• Designed to be 100% sequential writes, with reads
served from OS page cache
• Writes relatively low IOPS, high throughput, large block
size, and sequential
• Reads from disk can be intermittent depending on the
existence of latent consumers, and when reads occur
they are typically random small block high IOPS reads
• Kafka is somewhat latency sensitive but more tolerant
than a NoSQL database for example
Application Profile
• Internal cloud NoSQL database
• Medium/high IOPS, small block sizes, random reads
and writes
• Designed to support low latency read and write use
cases, therefore latency sensitive
• Mixture of reads and writes and block size is use case
dependent, typical observed distribution in standard
key-value cluster is 70r/30w
Pulsar
Application Profile
• HDFS Data Node – low IOPS, very large blocks,
sequential reads and writes, not extremely latency
sensitive
• YARN NodeManager Temp Space – medium IOPS,
higher throughput, tends towards more random write
patterns, slightly more sequential read patterns
• High Performance Admin Nodes (name nodes,
journal/zookeeper nodes) high IOPS, small block size,
random reads and writes, these nodes typically
perform better and improve overall cluster performance
with high performance storage
Key Objectives for Modern Workloads
• Performance
• Availability, Reliability, Resiliency
• Manageability, APIs, Integrations
• Workload Isolation
• Data Intensive Applications
12
Disaggregated vs Hyper-Converged ….
FIGHT!! "graffiti, Leake Street" by
duncan c
CC BY-NC 2.0
Recommended Approach for Kafka
Divide and Conquer
• Use HDDs for Collectors
• Use SSDs for
Aggregates
Recommended Approach for Pulsar
Disaggregated if:
• Can Handle high number of IOPS
• Meet the capacity
• Network latency issues can be mitigated
Hyper-Converged if:
• Compute has local SSDs/NVMEs
• Enough capacity
Pulsar
HDFS Advantages (Hyper-Converged
Storage)
• Native to Hadoop
• Fast
• Data Locality
• Less Network Traffic
• Compatibility
• Large Files
16
S3 Advantages (Disaggregated)
• Scalability
• Durability
• Persistence
• Price
• Flexibility
17
Swift
RGW
HDFS and S3 Together
• S3
• Data Ingest Storage
• Results Storage
• HDFS
• Transient Storage
• Alternative storage formats Parquet/ORC
18
HDFS
and
S3
Working
Together
Stuff...
19Credit: NASA
20
Testing and Validation
Approach
The test plans for each application platform are designed to represent typical use cases for those
applications and test their performance, latency, and storage capacity.
Hadoop Big Data Platform
• Benchmark Tools
• Application Testing
Kafka Stream Data Platform
• Use internally developed automation to deploy and test Kafka clusters.
• Test Configuration and Scenarios
• ZooKeepers
Operational Considerations
22
"Space Shuttle
Endeavour's Control
Panels" by Steve
Jurvetson
CC BY 2.0
Operations and Support at Scale
23
• Noisy Neighbor - Which one?
• Where is the handoff between Ops and Engineering?
• Do you have Devops?
• When things start to break
• Synthetic workloads
Recap
• Application Profiles
• Our solutions
• Storage Recommendations
• Operational Considerations
25
Fin!!
HDFS Implementation
• 3 replicas
• Ephemeral storage on compute node
• Nothing Fancy
26
"A wall of hard drives!" by Scott Schiller
CC BY 2.0
Network Considerations
• Does Hadoop know about your network
• S3 implementation as close as possible
• Where is your data coming from
29
Multiple Approaches to Infrastructure
Hyper-Converged Disaggregated
S3 Implementation
• Ceph??
• Strong Consistency
• Might already be there
• Uses Proxy between S3 and librados
• Swift??
• Native performance
• Under Openstacks big tent
• Focused on object storage
• AWS
• No infrastructure setup
• Reliability
• Easy scaling and Capacity Planning
31

Contenu connexe

Tendances

Cisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with CassandraCisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with CassandraDataStax Academy
 
New use cases for Ceph, beyond OpenStack, Luis Rico
New use cases for Ceph, beyond OpenStack, Luis RicoNew use cases for Ceph, beyond OpenStack, Luis Rico
New use cases for Ceph, beyond OpenStack, Luis RicoCeph Community
 
Red Hat Storage Day New York - Persistent Storage for Containers
Red Hat Storage Day New York - Persistent Storage for ContainersRed Hat Storage Day New York - Persistent Storage for Containers
Red Hat Storage Day New York - Persistent Storage for ContainersRed_Hat_Storage
 
Red Hat Storage Day LA - Persistent Storage for Linux Containers
Red Hat Storage Day LA - Persistent Storage for Linux Containers Red Hat Storage Day LA - Persistent Storage for Linux Containers
Red Hat Storage Day LA - Persistent Storage for Linux Containers Red_Hat_Storage
 
RedisDay London 2018 - Layered Orchestration & Redis Enterprise for fun and p...
RedisDay London 2018 - Layered Orchestration & Redis Enterprise for fun and p...RedisDay London 2018 - Layered Orchestration & Redis Enterprise for fun and p...
RedisDay London 2018 - Layered Orchestration & Redis Enterprise for fun and p...Redis Labs
 
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackCisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackDataStax Academy
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAlluxio, Inc.
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraCeph Community
 
Red Hat Storage Day Boston - Supermicro Super Storage
Red Hat Storage Day Boston - Supermicro Super StorageRed Hat Storage Day Boston - Supermicro Super Storage
Red Hat Storage Day Boston - Supermicro Super StorageRed_Hat_Storage
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.
 
Running Analytics at the Speed of Your Business
Running Analytics at the Speed of Your BusinessRunning Analytics at the Speed of Your Business
Running Analytics at the Speed of Your BusinessRedis Labs
 
RedisDay London 2018 - Stack Overflow's Next Steps in Redis
RedisDay London 2018 - Stack Overflow's Next Steps in RedisRedisDay London 2018 - Stack Overflow's Next Steps in Redis
RedisDay London 2018 - Stack Overflow's Next Steps in RedisRedis Labs
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed_Hat_Storage
 
Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...
Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...
Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...Red_Hat_Storage
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Data Con LA
 
Why Software-Defined Storage Matters
Why Software-Defined Storage MattersWhy Software-Defined Storage Matters
Why Software-Defined Storage MattersColleen Corrice
 
Red hat storage objects, containers and Beyond!
Red hat storage objects, containers and Beyond!Red hat storage objects, containers and Beyond!
Red hat storage objects, containers and Beyond!andreas kuncoro
 
RedisConf17 - Home Depot - Turbo charging existing applications with Redis
RedisConf17 - Home Depot - Turbo charging existing applications with RedisRedisConf17 - Home Depot - Turbo charging existing applications with Redis
RedisConf17 - Home Depot - Turbo charging existing applications with RedisRedis Labs
 
AliCloud Object Storage Service (OSS) Core Features
AliCloud Object Storage Service (OSS) Core FeaturesAliCloud Object Storage Service (OSS) Core Features
AliCloud Object Storage Service (OSS) Core FeaturesAlibaba Cloud
 
Cloudian HyperStore 'Forever Live' Storage Platform
Cloudian HyperStore 'Forever Live' Storage PlatformCloudian HyperStore 'Forever Live' Storage Platform
Cloudian HyperStore 'Forever Live' Storage PlatformCloudian
 

Tendances (20)

Cisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with CassandraCisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
 
New use cases for Ceph, beyond OpenStack, Luis Rico
New use cases for Ceph, beyond OpenStack, Luis RicoNew use cases for Ceph, beyond OpenStack, Luis Rico
New use cases for Ceph, beyond OpenStack, Luis Rico
 
Red Hat Storage Day New York - Persistent Storage for Containers
Red Hat Storage Day New York - Persistent Storage for ContainersRed Hat Storage Day New York - Persistent Storage for Containers
Red Hat Storage Day New York - Persistent Storage for Containers
 
Red Hat Storage Day LA - Persistent Storage for Linux Containers
Red Hat Storage Day LA - Persistent Storage for Linux Containers Red Hat Storage Day LA - Persistent Storage for Linux Containers
Red Hat Storage Day LA - Persistent Storage for Linux Containers
 
RedisDay London 2018 - Layered Orchestration & Redis Enterprise for fun and p...
RedisDay London 2018 - Layered Orchestration & Redis Enterprise for fun and p...RedisDay London 2018 - Layered Orchestration & Redis Enterprise for fun and p...
RedisDay London 2018 - Layered Orchestration & Redis Enterprise for fun and p...
 
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackCisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStack
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
 
Red Hat Storage Day Boston - Supermicro Super Storage
Red Hat Storage Day Boston - Supermicro Super StorageRed Hat Storage Day Boston - Supermicro Super Storage
Red Hat Storage Day Boston - Supermicro Super Storage
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
 
Running Analytics at the Speed of Your Business
Running Analytics at the Speed of Your BusinessRunning Analytics at the Speed of Your Business
Running Analytics at the Speed of Your Business
 
RedisDay London 2018 - Stack Overflow's Next Steps in Redis
RedisDay London 2018 - Stack Overflow's Next Steps in RedisRedisDay London 2018 - Stack Overflow's Next Steps in Redis
RedisDay London 2018 - Stack Overflow's Next Steps in Redis
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference Architectures
 
Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...
Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...
Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
 
Why Software-Defined Storage Matters
Why Software-Defined Storage MattersWhy Software-Defined Storage Matters
Why Software-Defined Storage Matters
 
Red hat storage objects, containers and Beyond!
Red hat storage objects, containers and Beyond!Red hat storage objects, containers and Beyond!
Red hat storage objects, containers and Beyond!
 
RedisConf17 - Home Depot - Turbo charging existing applications with Redis
RedisConf17 - Home Depot - Turbo charging existing applications with RedisRedisConf17 - Home Depot - Turbo charging existing applications with Redis
RedisConf17 - Home Depot - Turbo charging existing applications with Redis
 
AliCloud Object Storage Service (OSS) Core Features
AliCloud Object Storage Service (OSS) Core FeaturesAliCloud Object Storage Service (OSS) Core Features
AliCloud Object Storage Service (OSS) Core Features
 
Cloudian HyperStore 'Forever Live' Storage Platform
Cloudian HyperStore 'Forever Live' Storage PlatformCloudian HyperStore 'Forever Live' Storage Platform
Cloudian HyperStore 'Forever Live' Storage Platform
 

En vedette

Simple Innovation: Scaling Talent Development at Comcast
Simple Innovation: Scaling Talent Development at ComcastSimple Innovation: Scaling Talent Development at Comcast
Simple Innovation: Scaling Talent Development at ComcastHuman Capital Media
 
Pachyderm big data de l'ère docker
Pachyderm big data de l'ère dockerPachyderm big data de l'ère docker
Pachyderm big data de l'ère dockerEnguerran Delahaie
 
Big Data on OpenStack
Big Data on OpenStackBig Data on OpenStack
Big Data on OpenStackNati Shalom
 
Big Data and OpenStack, a Love Story: Michael Still, Rackspace
Big Data and OpenStack, a Love Story: Michael Still, RackspaceBig Data and OpenStack, a Love Story: Michael Still, Rackspace
Big Data and OpenStack, a Love Story: Michael Still, RackspaceOpenStack
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosHeiko Loewe
 
Buzzwords : la liste des buzzwords les plus utilisés sur les profils LinkedIn
Buzzwords :  la liste des buzzwords les plus utilisés sur les profils LinkedInBuzzwords :  la liste des buzzwords les plus utilisés sur les profils LinkedIn
Buzzwords : la liste des buzzwords les plus utilisés sur les profils LinkedInLinkedIn
 

En vedette (6)

Simple Innovation: Scaling Talent Development at Comcast
Simple Innovation: Scaling Talent Development at ComcastSimple Innovation: Scaling Talent Development at Comcast
Simple Innovation: Scaling Talent Development at Comcast
 
Pachyderm big data de l'ère docker
Pachyderm big data de l'ère dockerPachyderm big data de l'ère docker
Pachyderm big data de l'ère docker
 
Big Data on OpenStack
Big Data on OpenStackBig Data on OpenStack
Big Data on OpenStack
 
Big Data and OpenStack, a Love Story: Michael Still, Rackspace
Big Data and OpenStack, a Love Story: Michael Still, RackspaceBig Data and OpenStack, a Love Story: Michael Still, Rackspace
Big Data and OpenStack, a Love Story: Michael Still, Rackspace
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and Mesos
 
Buzzwords : la liste des buzzwords les plus utilisés sur les profils LinkedIn
Buzzwords :  la liste des buzzwords les plus utilisés sur les profils LinkedInBuzzwords :  la liste des buzzwords les plus utilisés sur les profils LinkedIn
Buzzwords : la liste des buzzwords les plus utilisés sur les profils LinkedIn
 

Similaire à Big data talk barcelona - jsr - jc

New Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference ArchitecturesNew Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference ArchitecturesKamesh Pemmaraju
 
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?Red_Hat_Storage
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Community
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSSteve Wong
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and FutureDataWorks Summit
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutionssolarisyougood
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2hdhappy001
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin
 
Designing OpenStack Architectures
Designing OpenStack ArchitecturesDesigning OpenStack Architectures
Designing OpenStack ArchitecturesKamesh Pemmaraju
 
DUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution PlansDUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution PlansAndrey Kudryavtsev
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 

Similaire à Big data talk barcelona - jsr - jc (20)

New Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference ArchitecturesNew Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference Architectures
 
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
 
Designing OpenStack Architectures
Designing OpenStack ArchitecturesDesigning OpenStack Architectures
Designing OpenStack Architectures
 
DUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution PlansDUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution Plans
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 

Dernier

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 

Dernier (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Big data talk barcelona - jsr - jc

  • 1. Architectural Considerations for Big Data Workloads on OpenStack OpenStack Summit, Barcelona October 27, 2016 Jonathan Chiang - Cloud Architect, Comcast James Saint-Rossy - Principal Engineer, Comcast
  • 2. Introductions James Saint-Rossy, Principal Engineer 2 Jonathan Chiang, Cloud Architect Chris Power, MIA
  • 3. Agenda •What we do •Comcast’s journey with OpenStack •Big data use cases at Comcast •Our application profiles •Key Objectives of Modern workloads •Disaggregated vs Hyper-Converged •Recommended Approaches for the different use cases •HDFS and S3 working together
  • 4. 4 A Fortune 50 Company Uniquely Positioned at the Intersection of Media and Technology TV, Internet, Voice and Home Cable Networks Film Broadcast Television Theme Parks
  • 5. Stretching the Comcast Elastic Cloud | Our Journey with OpenStack •Petabyte of Memory and One Million vCPU Cores in 2016 •Multi-Petabyte Ceph Block and Object Storage •Multi-Terabyte SSD Block Storage •Deployed across 34 Regions • National and Regional Data Centers •Icehouse Release Today, Moving Directly to Mitaka
  • 6. Community Contributions •Lines of code: 95,000 •Commits: 1200 •Core Developers and Reviewers on Multiple Projects •Since Vancouver Summit (Kilo), Comcast has doubled its upstream contributions
  • 7. Big Data Use Cases at Comcast 7 Real-time Telemetry Data Streaming Image Recognition Statistical Data Analysis Machine Learning NoSQL Databases Pulsar
  • 8. Application Profile • Designed to be 100% sequential writes, with reads served from OS page cache • Writes relatively low IOPS, high throughput, large block size, and sequential • Reads from disk can be intermittent depending on the existence of latent consumers, and when reads occur they are typically random small block high IOPS reads • Kafka is somewhat latency sensitive but more tolerant than a NoSQL database for example
  • 9. Application Profile • Internal cloud NoSQL database • Medium/high IOPS, small block sizes, random reads and writes • Designed to support low latency read and write use cases, therefore latency sensitive • Mixture of reads and writes and block size is use case dependent, typical observed distribution in standard key-value cluster is 70r/30w Pulsar
  • 10. Application Profile • HDFS Data Node – low IOPS, very large blocks, sequential reads and writes, not extremely latency sensitive • YARN NodeManager Temp Space – medium IOPS, higher throughput, tends towards more random write patterns, slightly more sequential read patterns • High Performance Admin Nodes (name nodes, journal/zookeeper nodes) high IOPS, small block size, random reads and writes, these nodes typically perform better and improve overall cluster performance with high performance storage
  • 11. Key Objectives for Modern Workloads • Performance • Availability, Reliability, Resiliency • Manageability, APIs, Integrations • Workload Isolation • Data Intensive Applications
  • 12. 12 Disaggregated vs Hyper-Converged …. FIGHT!! "graffiti, Leake Street" by duncan c CC BY-NC 2.0
  • 13.
  • 14. Recommended Approach for Kafka Divide and Conquer • Use HDDs for Collectors • Use SSDs for Aggregates
  • 15. Recommended Approach for Pulsar Disaggregated if: • Can Handle high number of IOPS • Meet the capacity • Network latency issues can be mitigated Hyper-Converged if: • Compute has local SSDs/NVMEs • Enough capacity Pulsar
  • 16. HDFS Advantages (Hyper-Converged Storage) • Native to Hadoop • Fast • Data Locality • Less Network Traffic • Compatibility • Large Files 16
  • 17. S3 Advantages (Disaggregated) • Scalability • Durability • Persistence • Price • Flexibility 17 Swift RGW
  • 18. HDFS and S3 Together • S3 • Data Ingest Storage • Results Storage • HDFS • Transient Storage • Alternative storage formats Parquet/ORC 18
  • 20. 20
  • 21. Testing and Validation Approach The test plans for each application platform are designed to represent typical use cases for those applications and test their performance, latency, and storage capacity. Hadoop Big Data Platform • Benchmark Tools • Application Testing Kafka Stream Data Platform • Use internally developed automation to deploy and test Kafka clusters. • Test Configuration and Scenarios • ZooKeepers
  • 22. Operational Considerations 22 "Space Shuttle Endeavour's Control Panels" by Steve Jurvetson CC BY 2.0
  • 23. Operations and Support at Scale 23 • Noisy Neighbor - Which one? • Where is the handoff between Ops and Engineering? • Do you have Devops? • When things start to break • Synthetic workloads
  • 24. Recap • Application Profiles • Our solutions • Storage Recommendations • Operational Considerations
  • 26. HDFS Implementation • 3 replicas • Ephemeral storage on compute node • Nothing Fancy 26
  • 27.
  • 28. "A wall of hard drives!" by Scott Schiller CC BY 2.0
  • 29. Network Considerations • Does Hadoop know about your network • S3 implementation as close as possible • Where is your data coming from 29
  • 30. Multiple Approaches to Infrastructure Hyper-Converged Disaggregated
  • 31. S3 Implementation • Ceph?? • Strong Consistency • Might already be there • Uses Proxy between S3 and librados • Swift?? • Native performance • Under Openstacks big tent • Focused on object storage • AWS • No infrastructure setup • Reliability • Easy scaling and Capacity Planning 31