SlideShare une entreprise Scribd logo
1  sur  46
Télécharger pour lire hors ligne
1 © Hortonworks Inc. 2011–2018. All rights reserved.
Containers and Big Data
Sanjay Radia
Chief Architect, Founder Hortonworks
Apache Hadoop PMC
2 © Hortonworks Inc. 2011–2018. All rights reserved.
About the Speakers
Sanjay Radia
• Chief Architect, Founder, Hortonworks
• Apache Hadoop PMC and Committer
• Part of the original Hadoop team at Yahoo! since 2007
• Chief Architect of Hadoop Core at Yahoo!
• Prior
• Data center automation, virtualization, Java, HA, OSs, File Systems
• Startup, Sun Microsystems, INRIA…
• Ph.D., University of Waterloo
Page 2
Architecting the Future of Big Data
3 © Hortonworks Inc. 2011–2018. All rights reserved.
Agenda
• Our recent extensive experience with containers and Docker
• A quick peak into the future – containerized Big Data
• A detailed look at considerations in containerizing big data frameworks and applications
• Understanding Big Data eco-system
• General considerations
• Considerations for jobs
• Considerations for services
• Considerations for platforms
4 © Hortonworks Inc. 2011–2018. All rights reserved.
At Hortonworks, We Run Many Many Tests...
Dozens of product releases a year
...over 30 open source projects
...across a dozen supported Linux operating systems
…and multiple backend databases
Result: Tens of thousands of tests per release
5 © Hortonworks Inc. 2011–2018. All rights reserved.
...on a Container Cloud Powered by Apache Hadoop YARN
YARN
Jenkins
Worker
(Docker)
Testing HDP and HDF releases in container clusters
Worker
(Docker)
Worker
(Docker)
HDP
(Docker)
HDP
(Docker)
HDP
(Docker)
HDP
(Docker)
HDFS
That’s right! HDP running in Docker containers on YARN!
6 © Hortonworks Inc. 2011–2018. All rights reserved.
2 years and 7 million containers later...
Many real world lessons learned
7 © Hortonworks Inc. 2011–2018. All rights reserved.
Let’s Talk About
Containers
8 © Hortonworks Inc. 2011–2018. All rights reserved.
Machines to VMs to Containers
- The Pursuit of Agility, Better, Cheaper
Physical
Machines
VMs Containers
● Simplified IT
● Business Agility
● Consolidated Hardware
● Improved Utilization
● More efficient than VMs
● Cheaper Cost for IT
● Business agility
● Hybrid deployment value
● Continued consolidation
*** Older & newer systems coexist, newer tech increasingly taking larger share
● Under utilized HW
● Too long to provision for
new business ideas
● Expensive
● Inefficient
● What are the
challenges?
9 © Hortonworks Inc. 2011–2018. All rights reserved.
• Industry adoption continues
• “Number of containerized applications will rise by 80%
in the next two years” [1]
• Patterns emerging
• Multi-cloud and hybrid strategies
• Adoption of Microservices
• Exponential ecosystem growth
• Dozens of container orchestrators
• Thousands of plugins
• Market moves
Containerization Is Gaining Momentum
1. http://i.dell.com/sites/doccontent/business/solutions/whitepapers/en/Documents/Containers_Real_Adoption_2017_Dell_EMC_Forrester_Paper.pdf
10 © Hortonworks Inc. 2011–2018. All rights reserved.
• Improved hardware utilization through increased density
• No virtual machine operating system overhead
• Image layer reuse limits data duplication on disk
• Strong resource isolation
• Namespaces and cgroups
• Better software packaging (Docker)
• Package applications and dependencies together
• Distribution mechanism
• Improved developer self service
• Agility & control over the execution environment
Why Are Containers Gaining Popularity?
11 © Hortonworks Inc. 2011–2018. All rights reserved.
• Mix of services and jobs
• Traditionally long lived services,
• But ephemeral jobs emerging
– Different scheduling needs
• Decoupled compute and storage
• Scale independently
• What does this mean for Hadoop?
• Hybrid deployments
• Desire for consistency between cloud and on-prem
• Cloud vendors are adopting Kubernetes
Container Architecture Patterns
12 © Hortonworks Inc. 2011–2018. All rights reserved.
Let’s Talk Big Data
13 © Hortonworks Inc. 2011–2018. All rights reserved.
The Road to Big Data—the Pursuit of Faster, Better, Cheaper
Siloed
Data
Systems
(ERP, CRM, DBs,
SAN, NAS)
Apache
Hadoop
Ecosystem
● Massive scale
● More efficient than Siloed systems
● Cheaper Cost for IT
● Business agility
● A Virtualized environment for multiple apps
● Consolidated Compute & Dala
*** Older & newer systems coexist, newer tech increasingly taking larger share
Containerized
Hadoop
Ecosystem
● What are the benefits &
challenges?
14 © Hortonworks Inc. 2011–2018. All rights reserved.
• Store and process data cost effectively at unprecedented scale
• Batch and interactive workloads on shared infrastructure
• Multi-tenant resource allocation capabilities
• Scale out – data, IO, compute using commodity hardware
• Much more evolved today ...
• Security & Governance systems
• BI tooling
• Operational tooling
• Serving systems
• ML
• AI
• Streaming
• SQL
• ...
What Really Made Big Data and Hadoop Popular?
15 © Hortonworks Inc. 2011–2018. All rights reserved.
• Big Data = Platform + Workloads
• Workloads
• User inputs resource requirements for work to be done
• Work is horizontally scaled adding removing containers
• Platform schedules containers
• Platform
• Manages resources to run workloads
• Includes advanced scheduling capabilities
• Multi-tenant support (Queues, Capacity guarantees, …)
• Fine-grained scheduling
At its Core, Big Data is about a Platform & the Workloads
Platform
Work
load
Work
load
Work
load
16 © Hortonworks Inc. 2011–2018. All rights reserved.
• Jobs
• Batch or Interactive, short-lived, ephemeral
• Services
• Long running, persistent
• Platforms
• Schedulers, orchestrators, resource management
• The plumbing for Jobs and Services
• Supports a mix of Jobs and Services
• Security beyond client-server (tokens …)
• Unique: Horizontal scaling
• Unique: Understands locality and can move work closer to data
• Networks getting faster, but speed of light ...
Multiple Classes of Big Data Application Type
** The lines may be blurred in cases **
e.g. Hive
17 © Hortonworks Inc. 2011–2018. All rights reserved.
Jobs
Long Running
Services
Platform
MapReduce HBase YARN
Hive + Tez Spark Streaming K8S
Spark Storm Cloud
Hive LLAP
Example Systems
18 © Hortonworks Inc. 2011–2018. All rights reserved.
Containers & Big Data
Together
19 © Hortonworks Inc. 2011–2018. All rights reserved.
• Workloads can be platforms in disguise
• Workloads have varying requirements on collocation
• Different levels, fat containers to micro-services
• Stepping back, it’s similar to Big Data and VMs – what has changed?
Understand your goal for containerizing
• Consolidation, Utilization, Simplify/Unify
• Is it side-by-side or YARN on K8S?
There are tradeoffs – you may achieve one goal but lose on another
It’s Complicated
Many nuances depending on the workload and systems
20 © Hortonworks Inc. 2011–2018. All rights reserved.
Considerations for
Containerization
21 © Hortonworks Inc. 2011–2018. All rights reserved.
General (Non Big Data)
Considerations
- OS Stability
- Fat containers and microservices
- Stateless & Stateful + Persustence-
- Networking
22 © Hortonworks Inc. 2011–2018. All rights reserved.
• Containers are tightly coupled to the OS kernel
• Containers leverage advanced kernel features
• Poor support in many kernels
• One container can cripple the host
• Run the newest kernel possible
• Image Management
• How do you keep uptodate with the security fixes for your Docker images
• DIY or the application vendor?
• Docker storage driver selection matters esp for root-fs
• Heavy writes lead to panics
• SSDs may be needed
• Use overlay2 if workload allows it
System Stability
23 © Hortonworks Inc. 2011–2018. All rights reserved.
• Lift and shift
• “Containers as VMs”
• Need an init system
• systemd in containers comes with gotchas
• Good intermediate step
• Full decomposition is a journey
• Requires modification
• What is gained?
Fat Containers and Microservices
Fat Containers
Microservices
24 © Hortonworks Inc. 2011–2018. All rights reserved.
• Recall that shared storage (SANs or NFS) played a critical role in VMs
• To allow VMs to be allocated anywhere and relocated
• Stateless is ideal for containerization
• Greatly simplifies the life cycle when you don’t have to think about data
• Handling state - What kind of state?
• Data persistence? In-memory store?
• How does the application recover state?
• Is Checkpointing needed?
• Performance impact?
• Make it remote shared storage seems to solve many problems
• but does it impact performance (iscsi for HDFS!!)
Stateless and Stateful plus Persistence
25 © Hortonworks Inc. 2011–2018. All rights reserved.
• The great thing about networking is all the options /sarcasm
• Burden is on operations
• Goal: IP-address per container and no NAT
• DNS plays a more critical role as container IP can change and services are dynamic
• but also have static service-IPs
• Cluster-wide versus corp-wide routing
• Mulitple networking options
• No one size fits all, know your network and use cases
• Many plugins available to suit any need, but it’s on you
Networking
26 © Hortonworks Inc. 2011–2018. All rights reserved.
Considerations for Jobs
27 © Hortonworks Inc. 2011–2018. All rights reserved.
• Summary: The systems that power these workloads typically run on other
platforms/orchestrators. Most commonly these are analytic workloads on other
platforms
• Examples: MapReduce, Apache Hive + Tez, Apache Spark
• Benefits of containerization
• Packaging dependent libraries in Docker image
• Challenges
• Data locality and networking considerations
• Security: User identity propagation from base OS
Workloads: Consideration for Batch/Ephemeral/Interactive
29 © Hortonworks Inc. 2011–2018. All rights reserved.
Considerations for Services
30 © Hortonworks Inc. 2011–2018. All rights reserved.
• Summary: These are typically serving systems with varying requirements.
• In many cases low latency online serving use cases that have specific resource requirements.
• Examples: Apache HBase, Apache Spark SQL/Streaming, Apache Storm, Apache Hive LLAP
• Benefits of containerzation
• Ease of deployment
• Horizontal scaling
• Challenges
• Client access (discovery, relocation, load-balancing)
• Data locality considerations including short circuit reads
• Token/key expiration
• Management: config changes, upgrades, monitoring etc. - not needed for jobs
Consideration for Long Running Services
32 © Hortonworks Inc. 2011–2018. All rights reserved.
Considerations for Platforms
33 © Hortonworks Inc. 2011–2018. All rights reserved.
• Summary: Workloads run on these platforms.
• Many platforms expect that they “own” the hardware/VM, which differs from workloads.
• => harder to containerize
• Examples: YARN, K8s, Docker Swarm, Nomad, Mesos ...
• Benefits for containerizing these platforms
• Hardware utilization across platforms
• Leverage existing HW investment of these platforms for more apps
• Two choice:s containerize the platform or run containers on top of platform (YARN)
• Developer platform clusters without separate hardware
• Challenges
• Resource sharing (next slide)
• Networking considerations
• User propagation
Consideration for Platforms
34 © Hortonworks Inc. 2011–2018. All rights reserved.
• Resource management challenges – Running Two schedulers
• Running YARN workloads on top of K8s
• Running K8s as a long running service on YARN
• Side-by-side, but have some elastic CPU that is moved back or forth
• Full scheduler integration? Tracking jobs in YARN.
• How are resources shared?
• CPU is elastic and shareable, but memory is not
• Unused CPU can be used elsewhere, but not memory
– So unused capacity on YARN, Spark, or Hive LLAP nodes may not be used by other containers
• How do the schedulers cooperate to communicate resource consumption?
• Cgroups for resource tracking
• Containers added and removed on demand / possibly resized
• What about storage platforms (HDFS)?
• DataNodes are not moveable, storage is not elastic
• IO bandwidth is shareable, but dedicated IO is critical to HDFS/Big Data apps
• Leads to use of fat containers or bare metal for DataNodes and YARN workloads
Running platforms on container platforms
36 © Hortonworks Inc. 2011–2018. All rights reserved.
Summary
• Many nuances depending on the workload and systems
• Each application requires its own considerations when running in containers.
• Many benefitsto running big data workloads in containers
• But it is important to understand the trade-offs that must be made.
• How and what you do is important
37 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Going Forward and Summary
38 © Hortonworks Inc. 2011–2018. All rights reserved.
Our Approach Going Forward
• Every Application and Framework will be containerized
• Can run on K8s or YARN containers
• Address security concerns
• Root, User propagation etc.
• For fined grained scheduling user built-in scheduler (Hive) or YARN
• Let the bottom container layer do the main scheduling
• K8S if you have that, or YARN if you do not
• Hybrid - Cloud and On-Prem has same Experience & Architecture
• Some of the cloudy experience comes to on-prem
• Elasticity (within limits of on-prem capacity)
• Business agility
39 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Cloud Storage Cloud Compute
Hive Spark NiFi
Ozone/HDFS K8s/YARN
YARN Kafka
H O R T O N W O R K S D A T A P L A N E S E R V I C E S
CLOUD
Workload specific clusters
User provisioned
CONTAINERS ON PREM
Workload specific clusters
User provisioned
Atlas, Ranger,
Metastore, Knox
SHARED
meta-services spans
multiple clouds and
colos
Hive Spark NiFi
Hive Spark NiFi
Hive Spark NiFi
YARN Kafka
41 © Hortonworks Inc. 2011–2018. All rights reserved.
Thank you
42 © Hortonworks Inc. 2011–2018. All rights reserved.
Extra Slides
43 © Hortonworks Inc. 2011–2018. All rights reserved.
TODO: Talk about Other Services
Very nuanced
- Kafka
- Solr
- ZooKeeper
44 © Hortonworks Inc. 2011–2018. All rights reserved.
More details on our internal container cloud running on YARN
Link to Sunitha’s talk
45 © Hortonworks Inc. 2011–2018. All rights reserved.
Motivations for Kubernetes
• Virtualized environment - CPU, Memory,…
• Elasticity for apps/services
• Business agility – a new service up an running
• Improved utilization
• Consolidation of on single platform
• Simpler IT management
• Cheaper than VMware - but only Linux Apps
• Storage is orthogonal
46 © Hortonworks Inc. 2011–2018. All rights reserved.
Hadoop & Kubernetes:
Side-by-side or Hadoop on top of Kubernetes?
• How do Big Data workloads compare to other apps typically run on Kubernetes
• Other apps services simply use some resources to perform some work
• You size the container to fit the work, add more containers to scale horizontally
• Unused container capacity? Size better and avoid using Horizontal scaling
• Hadoop and some of their Apps
• Have their own schedulers for process querying , workloads
• YARN, Hive, Spark
• What to do about unused capacity
• Horizontal Scaling helps but each container is large to allow potentailly large workload
• Can scale down the containers but not the size – hence what do do about unused capacity?
• Storage
• Hadoop apps have very very large storage needs – a large multi-petabyte DataLake
47 © Hortonworks Inc. 2011–2018. All rights reserved.
Full Decomposition Requires Modification
Fat Containers Microservices
48 © Hortonworks Inc. 2011–2018. All rights reserved.
Current State of
Containerization in
Hadoop YARN
49 © Hortonworks Inc. 2011–2018. All rights reserved.
Powerful Container Model in YARN
• Support for Docker containers
• Packaging and resource isolation
• Packing easier e.g. TensorFlow
• Complements YARN’s support for long running services
• Examples – ML/AI becomes simpler
• Run Spark as regular Yarn jobs or as Docker
• Allows ability to run pre-packed Spark/ML applications
• Run TensorFlow with or without Docker
• Affinity/anti-Affinity for GPUs
• Can securely access HDFS
• HDFS 3.0 will come with TechPreview of TensorFlow framework on Yarn
• Yarn on Yarn …..
50 © Hortonworks Inc. 2011–2018. All rights reserved.
Hadoop Apps
YARN on YARN… Towards a Private Cloud …YCloud
YARN
MR Tez Spark
Tensor
Flow
YARN
MR Tez Spark
Spark

Contenu connexe

Tendances

Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFiThe First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
DataWorks Summit
 
Running Enterprise Workloads with an open source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an open source Hybrid Cloud Data ArchitectureRunning Enterprise Workloads with an open source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an open source Hybrid Cloud Data Architecture
DataWorks Summit
 
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an Open Source Hybrid Cloud Data ArchitectureRunning Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
DataWorks Summit
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
DataWorks Summit
 
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo
DataWorks Summit
 
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionApache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the Union
DataWorks Summit
 

Tendances (20)

Containers and Big Data
Containers and Big Data Containers and Big Data
Containers and Big Data
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Keynote
KeynoteKeynote
Keynote
 
Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFiThe First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
 
The Car of the Future - Autonomous, Connected, and Data Centric
The Car of the Future - Autonomous, Connected, and Data CentricThe Car of the Future - Autonomous, Connected, and Data Centric
The Car of the Future - Autonomous, Connected, and Data Centric
 
Running Enterprise Workloads with an open source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an open source Hybrid Cloud Data ArchitectureRunning Enterprise Workloads with an open source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an open source Hybrid Cloud Data Architecture
 
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an Open Source Hybrid Cloud Data ArchitectureRunning Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
 
Data in the Cloud Crash Course
Data in the Cloud Crash CourseData in the Cloud Crash Course
Data in the Cloud Crash Course
 
Lessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARNLessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARN
 
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo
 
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionApache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the Union
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
 
Solving Cybersecurity at Scale
Solving Cybersecurity at ScaleSolving Cybersecurity at Scale
Solving Cybersecurity at Scale
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar Slides
 
HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical Workshop
 

Similaire à Containers and Big Data

Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
DataWorks Summit
 
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformPivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
EMC
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 
Compute-based sizing and system dashboard
Compute-based sizing and system dashboardCompute-based sizing and system dashboard
Compute-based sizing and system dashboard
DataWorks Summit
 

Similaire à Containers and Big Data (20)

Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
 
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storyApache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration story
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Enterprise Architecture in the Era of Big Data and Quantum Computing
Enterprise Architecture in the Era of Big Data and Quantum ComputingEnterprise Architecture in the Era of Big Data and Quantum Computing
Enterprise Architecture in the Era of Big Data and Quantum Computing
 
Building a modern end-to-end open source Big Data reference application
Building a modern end-to-end open source Big Data reference applicationBuilding a modern end-to-end open source Big Data reference application
Building a modern end-to-end open source Big Data reference application
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
What ya gonna do?
What ya gonna do?What ya gonna do?
What ya gonna do?
 
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFT
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
 
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformPivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Compute-based sizing and system dashboard
Compute-based sizing and system dashboardCompute-based sizing and system dashboard
Compute-based sizing and system dashboard
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
 
Cloud-Native Data: What data questions to ask when building cloud-native apps
Cloud-Native Data: What data questions to ask when building cloud-native appsCloud-Native Data: What data questions to ask when building cloud-native apps
Cloud-Native Data: What data questions to ask when building cloud-native apps
 

Plus de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 

Containers and Big Data

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved. Containers and Big Data Sanjay Radia Chief Architect, Founder Hortonworks Apache Hadoop PMC
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved. About the Speakers Sanjay Radia • Chief Architect, Founder, Hortonworks • Apache Hadoop PMC and Committer • Part of the original Hadoop team at Yahoo! since 2007 • Chief Architect of Hadoop Core at Yahoo! • Prior • Data center automation, virtualization, Java, HA, OSs, File Systems • Startup, Sun Microsystems, INRIA… • Ph.D., University of Waterloo Page 2 Architecting the Future of Big Data
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved. Agenda • Our recent extensive experience with containers and Docker • A quick peak into the future – containerized Big Data • A detailed look at considerations in containerizing big data frameworks and applications • Understanding Big Data eco-system • General considerations • Considerations for jobs • Considerations for services • Considerations for platforms
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved. At Hortonworks, We Run Many Many Tests... Dozens of product releases a year ...over 30 open source projects ...across a dozen supported Linux operating systems …and multiple backend databases Result: Tens of thousands of tests per release
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved. ...on a Container Cloud Powered by Apache Hadoop YARN YARN Jenkins Worker (Docker) Testing HDP and HDF releases in container clusters Worker (Docker) Worker (Docker) HDP (Docker) HDP (Docker) HDP (Docker) HDP (Docker) HDFS That’s right! HDP running in Docker containers on YARN!
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved. 2 years and 7 million containers later... Many real world lessons learned
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved. Let’s Talk About Containers
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved. Machines to VMs to Containers - The Pursuit of Agility, Better, Cheaper Physical Machines VMs Containers ● Simplified IT ● Business Agility ● Consolidated Hardware ● Improved Utilization ● More efficient than VMs ● Cheaper Cost for IT ● Business agility ● Hybrid deployment value ● Continued consolidation *** Older & newer systems coexist, newer tech increasingly taking larger share ● Under utilized HW ● Too long to provision for new business ideas ● Expensive ● Inefficient ● What are the challenges?
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved. • Industry adoption continues • “Number of containerized applications will rise by 80% in the next two years” [1] • Patterns emerging • Multi-cloud and hybrid strategies • Adoption of Microservices • Exponential ecosystem growth • Dozens of container orchestrators • Thousands of plugins • Market moves Containerization Is Gaining Momentum 1. http://i.dell.com/sites/doccontent/business/solutions/whitepapers/en/Documents/Containers_Real_Adoption_2017_Dell_EMC_Forrester_Paper.pdf
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved. • Improved hardware utilization through increased density • No virtual machine operating system overhead • Image layer reuse limits data duplication on disk • Strong resource isolation • Namespaces and cgroups • Better software packaging (Docker) • Package applications and dependencies together • Distribution mechanism • Improved developer self service • Agility & control over the execution environment Why Are Containers Gaining Popularity?
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved. • Mix of services and jobs • Traditionally long lived services, • But ephemeral jobs emerging – Different scheduling needs • Decoupled compute and storage • Scale independently • What does this mean for Hadoop? • Hybrid deployments • Desire for consistency between cloud and on-prem • Cloud vendors are adopting Kubernetes Container Architecture Patterns
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved. Let’s Talk Big Data
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved. The Road to Big Data—the Pursuit of Faster, Better, Cheaper Siloed Data Systems (ERP, CRM, DBs, SAN, NAS) Apache Hadoop Ecosystem ● Massive scale ● More efficient than Siloed systems ● Cheaper Cost for IT ● Business agility ● A Virtualized environment for multiple apps ● Consolidated Compute & Dala *** Older & newer systems coexist, newer tech increasingly taking larger share Containerized Hadoop Ecosystem ● What are the benefits & challenges?
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved. • Store and process data cost effectively at unprecedented scale • Batch and interactive workloads on shared infrastructure • Multi-tenant resource allocation capabilities • Scale out – data, IO, compute using commodity hardware • Much more evolved today ... • Security & Governance systems • BI tooling • Operational tooling • Serving systems • ML • AI • Streaming • SQL • ... What Really Made Big Data and Hadoop Popular?
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved. • Big Data = Platform + Workloads • Workloads • User inputs resource requirements for work to be done • Work is horizontally scaled adding removing containers • Platform schedules containers • Platform • Manages resources to run workloads • Includes advanced scheduling capabilities • Multi-tenant support (Queues, Capacity guarantees, …) • Fine-grained scheduling At its Core, Big Data is about a Platform & the Workloads Platform Work load Work load Work load
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved. • Jobs • Batch or Interactive, short-lived, ephemeral • Services • Long running, persistent • Platforms • Schedulers, orchestrators, resource management • The plumbing for Jobs and Services • Supports a mix of Jobs and Services • Security beyond client-server (tokens …) • Unique: Horizontal scaling • Unique: Understands locality and can move work closer to data • Networks getting faster, but speed of light ... Multiple Classes of Big Data Application Type ** The lines may be blurred in cases ** e.g. Hive
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved. Jobs Long Running Services Platform MapReduce HBase YARN Hive + Tez Spark Streaming K8S Spark Storm Cloud Hive LLAP Example Systems
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved. Containers & Big Data Together
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved. • Workloads can be platforms in disguise • Workloads have varying requirements on collocation • Different levels, fat containers to micro-services • Stepping back, it’s similar to Big Data and VMs – what has changed? Understand your goal for containerizing • Consolidation, Utilization, Simplify/Unify • Is it side-by-side or YARN on K8S? There are tradeoffs – you may achieve one goal but lose on another It’s Complicated Many nuances depending on the workload and systems
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved. Considerations for Containerization
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved. General (Non Big Data) Considerations - OS Stability - Fat containers and microservices - Stateless & Stateful + Persustence- - Networking
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved. • Containers are tightly coupled to the OS kernel • Containers leverage advanced kernel features • Poor support in many kernels • One container can cripple the host • Run the newest kernel possible • Image Management • How do you keep uptodate with the security fixes for your Docker images • DIY or the application vendor? • Docker storage driver selection matters esp for root-fs • Heavy writes lead to panics • SSDs may be needed • Use overlay2 if workload allows it System Stability
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved. • Lift and shift • “Containers as VMs” • Need an init system • systemd in containers comes with gotchas • Good intermediate step • Full decomposition is a journey • Requires modification • What is gained? Fat Containers and Microservices Fat Containers Microservices
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved. • Recall that shared storage (SANs or NFS) played a critical role in VMs • To allow VMs to be allocated anywhere and relocated • Stateless is ideal for containerization • Greatly simplifies the life cycle when you don’t have to think about data • Handling state - What kind of state? • Data persistence? In-memory store? • How does the application recover state? • Is Checkpointing needed? • Performance impact? • Make it remote shared storage seems to solve many problems • but does it impact performance (iscsi for HDFS!!) Stateless and Stateful plus Persistence
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved. • The great thing about networking is all the options /sarcasm • Burden is on operations • Goal: IP-address per container and no NAT • DNS plays a more critical role as container IP can change and services are dynamic • but also have static service-IPs • Cluster-wide versus corp-wide routing • Mulitple networking options • No one size fits all, know your network and use cases • Many plugins available to suit any need, but it’s on you Networking
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved. Considerations for Jobs
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved. • Summary: The systems that power these workloads typically run on other platforms/orchestrators. Most commonly these are analytic workloads on other platforms • Examples: MapReduce, Apache Hive + Tez, Apache Spark • Benefits of containerization • Packaging dependent libraries in Docker image • Challenges • Data locality and networking considerations • Security: User identity propagation from base OS Workloads: Consideration for Batch/Ephemeral/Interactive
  • 28. 29 © Hortonworks Inc. 2011–2018. All rights reserved. Considerations for Services
  • 29. 30 © Hortonworks Inc. 2011–2018. All rights reserved. • Summary: These are typically serving systems with varying requirements. • In many cases low latency online serving use cases that have specific resource requirements. • Examples: Apache HBase, Apache Spark SQL/Streaming, Apache Storm, Apache Hive LLAP • Benefits of containerzation • Ease of deployment • Horizontal scaling • Challenges • Client access (discovery, relocation, load-balancing) • Data locality considerations including short circuit reads • Token/key expiration • Management: config changes, upgrades, monitoring etc. - not needed for jobs Consideration for Long Running Services
  • 30. 32 © Hortonworks Inc. 2011–2018. All rights reserved. Considerations for Platforms
  • 31. 33 © Hortonworks Inc. 2011–2018. All rights reserved. • Summary: Workloads run on these platforms. • Many platforms expect that they “own” the hardware/VM, which differs from workloads. • => harder to containerize • Examples: YARN, K8s, Docker Swarm, Nomad, Mesos ... • Benefits for containerizing these platforms • Hardware utilization across platforms • Leverage existing HW investment of these platforms for more apps • Two choice:s containerize the platform or run containers on top of platform (YARN) • Developer platform clusters without separate hardware • Challenges • Resource sharing (next slide) • Networking considerations • User propagation Consideration for Platforms
  • 32. 34 © Hortonworks Inc. 2011–2018. All rights reserved. • Resource management challenges – Running Two schedulers • Running YARN workloads on top of K8s • Running K8s as a long running service on YARN • Side-by-side, but have some elastic CPU that is moved back or forth • Full scheduler integration? Tracking jobs in YARN. • How are resources shared? • CPU is elastic and shareable, but memory is not • Unused CPU can be used elsewhere, but not memory – So unused capacity on YARN, Spark, or Hive LLAP nodes may not be used by other containers • How do the schedulers cooperate to communicate resource consumption? • Cgroups for resource tracking • Containers added and removed on demand / possibly resized • What about storage platforms (HDFS)? • DataNodes are not moveable, storage is not elastic • IO bandwidth is shareable, but dedicated IO is critical to HDFS/Big Data apps • Leads to use of fat containers or bare metal for DataNodes and YARN workloads Running platforms on container platforms
  • 33. 36 © Hortonworks Inc. 2011–2018. All rights reserved. Summary • Many nuances depending on the workload and systems • Each application requires its own considerations when running in containers. • Many benefitsto running big data workloads in containers • But it is important to understand the trade-offs that must be made. • How and what you do is important
  • 34. 37 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Going Forward and Summary
  • 35. 38 © Hortonworks Inc. 2011–2018. All rights reserved. Our Approach Going Forward • Every Application and Framework will be containerized • Can run on K8s or YARN containers • Address security concerns • Root, User propagation etc. • For fined grained scheduling user built-in scheduler (Hive) or YARN • Let the bottom container layer do the main scheduling • K8S if you have that, or YARN if you do not • Hybrid - Cloud and On-Prem has same Experience & Architecture • Some of the cloudy experience comes to on-prem • Elasticity (within limits of on-prem capacity) • Business agility
  • 36. 39 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Cloud Storage Cloud Compute Hive Spark NiFi Ozone/HDFS K8s/YARN YARN Kafka H O R T O N W O R K S D A T A P L A N E S E R V I C E S CLOUD Workload specific clusters User provisioned CONTAINERS ON PREM Workload specific clusters User provisioned Atlas, Ranger, Metastore, Knox SHARED meta-services spans multiple clouds and colos Hive Spark NiFi Hive Spark NiFi Hive Spark NiFi YARN Kafka
  • 37. 41 © Hortonworks Inc. 2011–2018. All rights reserved. Thank you
  • 38. 42 © Hortonworks Inc. 2011–2018. All rights reserved. Extra Slides
  • 39. 43 © Hortonworks Inc. 2011–2018. All rights reserved. TODO: Talk about Other Services Very nuanced - Kafka - Solr - ZooKeeper
  • 40. 44 © Hortonworks Inc. 2011–2018. All rights reserved. More details on our internal container cloud running on YARN Link to Sunitha’s talk
  • 41. 45 © Hortonworks Inc. 2011–2018. All rights reserved. Motivations for Kubernetes • Virtualized environment - CPU, Memory,… • Elasticity for apps/services • Business agility – a new service up an running • Improved utilization • Consolidation of on single platform • Simpler IT management • Cheaper than VMware - but only Linux Apps • Storage is orthogonal
  • 42. 46 © Hortonworks Inc. 2011–2018. All rights reserved. Hadoop & Kubernetes: Side-by-side or Hadoop on top of Kubernetes? • How do Big Data workloads compare to other apps typically run on Kubernetes • Other apps services simply use some resources to perform some work • You size the container to fit the work, add more containers to scale horizontally • Unused container capacity? Size better and avoid using Horizontal scaling • Hadoop and some of their Apps • Have their own schedulers for process querying , workloads • YARN, Hive, Spark • What to do about unused capacity • Horizontal Scaling helps but each container is large to allow potentailly large workload • Can scale down the containers but not the size – hence what do do about unused capacity? • Storage • Hadoop apps have very very large storage needs – a large multi-petabyte DataLake
  • 43. 47 © Hortonworks Inc. 2011–2018. All rights reserved. Full Decomposition Requires Modification Fat Containers Microservices
  • 44. 48 © Hortonworks Inc. 2011–2018. All rights reserved. Current State of Containerization in Hadoop YARN
  • 45. 49 © Hortonworks Inc. 2011–2018. All rights reserved. Powerful Container Model in YARN • Support for Docker containers • Packaging and resource isolation • Packing easier e.g. TensorFlow • Complements YARN’s support for long running services • Examples – ML/AI becomes simpler • Run Spark as regular Yarn jobs or as Docker • Allows ability to run pre-packed Spark/ML applications • Run TensorFlow with or without Docker • Affinity/anti-Affinity for GPUs • Can securely access HDFS • HDFS 3.0 will come with TechPreview of TensorFlow framework on Yarn • Yarn on Yarn …..
  • 46. 50 © Hortonworks Inc. 2011–2018. All rights reserved. Hadoop Apps YARN on YARN… Towards a Private Cloud …YCloud YARN MR Tez Spark Tensor Flow YARN MR Tez Spark Spark