Containers and Big Data

1 © Hortonworks Inc. 2011–2018. All rights reserved.
Containers and Big Data
Sanjay Radia
Chief Architect, Founder Hortonworks
Apache Hadoop PMC

About the Speakers
Sanjay Radia
• Chief Architect, Founder, Hortonworks
• Apache Hadoop PMC and Committer
• Part of the original Hadoop team at Yahoo! since 2007
• Chief Architect of Hadoop Core at Yahoo!
• Prior
• Data center automation, virtualization, Java, HA, OSs, File Systems
• Startup, Sun Microsystems, INRIA…
• Ph.D., University of Waterloo
Page 2
Architecting the Future of Big Data

Agenda
• Our recent extensive experience with containers and Docker
• A quick peak into the future – containerized Big Data
• A detailed look at considerations in containerizing big data frameworks and applications
• Understanding Big Data eco-system
• General considerations
• Considerations for jobs
• Considerations for services
• Considerations for platforms

At Hortonworks, We Run Many Many Tests...
Dozens of product releases a year
...over 30 open source projects
...across a dozen supported Linux operating systems
…and multiple backend databases
Result: Tens of thousands of tests per release

...on a Container Cloud Powered by Apache Hadoop YARN
YARN
Jenkins
Worker
(Docker)
Testing HDP and HDF releases in container clusters
Worker
(Docker)
Worker
(Docker)
HDP
(Docker)
HDP
(Docker)
HDP
(Docker)
HDP
(Docker)
HDFS
That’s right! HDP running in Docker containers on YARN!

2 years and 7 million containers later...
Many real world lessons learned

Let’s Talk About
Containers

Machines to VMs to Containers
- The Pursuit of Agility, Better, Cheaper
Physical
Machines
VMs Containers
● Simplified IT
● Business Agility
● Consolidated Hardware
● Improved Utilization
● More efficient than VMs
● Cheaper Cost for IT
● Business agility
● Hybrid deployment value
● Continued consolidation
*** Older & newer systems coexist, newer tech increasingly taking larger share
● Under utilized HW
● Too long to provision for
new business ideas
● Expensive
● Inefficient
● What are the
challenges?

• Industry adoption continues
• “Number of containerized applications will rise by 80%
in the next two years” [1]
• Patterns emerging
• Multi-cloud and hybrid strategies
• Adoption of Microservices
• Exponential ecosystem growth
• Dozens of container orchestrators
• Thousands of plugins
• Market moves
Containerization Is Gaining Momentum
1. http://i.dell.com/sites/doccontent/business/solutions/whitepapers/en/Documents/Containers_Real_Adoption_2017_Dell_EMC_Forrester_Paper.pdf

• Improved hardware utilization through increased density
• No virtual machine operating system overhead
• Image layer reuse limits data duplication on disk
• Strong resource isolation
• Namespaces and cgroups
• Better software packaging (Docker)
• Package applications and dependencies together
• Distribution mechanism
• Improved developer self service
• Agility & control over the execution environment
Why Are Containers Gaining Popularity?

• Mix of services and jobs
• Traditionally long lived services,
• But ephemeral jobs emerging
– Different scheduling needs
• Decoupled compute and storage
• Scale independently
• What does this mean for Hadoop?
• Hybrid deployments
• Desire for consistency between cloud and on-prem
• Cloud vendors are adopting Kubernetes
Container Architecture Patterns

Let’s Talk Big Data

The Road to Big Data—the Pursuit of Faster, Better, Cheaper
Siloed
Data
Systems
(ERP, CRM, DBs,
SAN, NAS)
Apache
Hadoop
Ecosystem
● Massive scale
● More efficient than Siloed systems
● Cheaper Cost for IT
● Business agility
● A Virtualized environment for multiple apps
● Consolidated Compute & Dala
*** Older & newer systems coexist, newer tech increasingly taking larger share
Containerized
Hadoop
Ecosystem
● What are the benefits &
challenges?

• Store and process data cost effectively at unprecedented scale
• Batch and interactive workloads on shared infrastructure
• Multi-tenant resource allocation capabilities
• Scale out – data, IO, compute using commodity hardware
• Much more evolved today ...
• Security & Governance systems
• BI tooling
• Operational tooling
• Serving systems
• ML
• AI
• Streaming
• SQL
• ...
What Really Made Big Data and Hadoop Popular?

• Big Data = Platform + Workloads
• Workloads
• User inputs resource requirements for work to be done
• Work is horizontally scaled adding removing containers
• Platform schedules containers
• Platform
• Manages resources to run workloads
• Includes advanced scheduling capabilities
• Multi-tenant support (Queues, Capacity guarantees, …)
• Fine-grained scheduling
At its Core, Big Data is about a Platform & the Workloads
Platform
Work
load
Work
load
Work
load

• Jobs
• Batch or Interactive, short-lived, ephemeral
• Services
• Long running, persistent
• Platforms
• Schedulers, orchestrators, resource management
• The plumbing for Jobs and Services
• Supports a mix of Jobs and Services
• Security beyond client-server (tokens …)
• Unique: Horizontal scaling
• Unique: Understands locality and can move work closer to data
• Networks getting faster, but speed of light ...
Multiple Classes of Big Data Application Type
** The lines may be blurred in cases **
e.g. Hive

Jobs
Long Running
Services
Platform
MapReduce HBase YARN
Hive + Tez Spark Streaming K8S
Spark Storm Cloud
Hive LLAP
Example Systems

Containers & Big Data
Together

• Workloads can be platforms in disguise
• Workloads have varying requirements on collocation
• Different levels, fat containers to micro-services
• Stepping back, it’s similar to Big Data and VMs – what has changed?
Understand your goal for containerizing
• Consolidation, Utilization, Simplify/Unify
• Is it side-by-side or YARN on K8S?
There are tradeoffs – you may achieve one goal but lose on another
It’s Complicated
Many nuances depending on the workload and systems

Considerations for
Containerization

General (Non Big Data)
Considerations
- OS Stability
- Fat containers and microservices
- Stateless & Stateful + Persustence-
- Networking

• Containers are tightly coupled to the OS kernel
• Containers leverage advanced kernel features
• Poor support in many kernels
• One container can cripple the host
• Run the newest kernel possible
• Image Management
• How do you keep uptodate with the security fixes for your Docker images
• DIY or the application vendor?
• Docker storage driver selection matters esp for root-fs
• Heavy writes lead to panics
• SSDs may be needed
• Use overlay2 if workload allows it
System Stability

• Lift and shift
• “Containers as VMs”
• Need an init system
• systemd in containers comes with gotchas
• Good intermediate step
• Full decomposition is a journey
• Requires modification
• What is gained?
Fat Containers and Microservices
Fat Containers
Microservices

• Recall that shared storage (SANs or NFS) played a critical role in VMs
• To allow VMs to be allocated anywhere and relocated
• Stateless is ideal for containerization
• Greatly simplifies the life cycle when you don’t have to think about data
• Handling state - What kind of state?
• Data persistence? In-memory store?
• How does the application recover state?
• Is Checkpointing needed?
• Performance impact?
• Make it remote shared storage seems to solve many problems
• but does it impact performance (iscsi for HDFS!!)
Stateless and Stateful plus Persistence

• The great thing about networking is all the options /sarcasm
• Burden is on operations
• Goal: IP-address per container and no NAT
• DNS plays a more critical role as container IP can change and services are dynamic
• but also have static service-IPs
• Cluster-wide versus corp-wide routing
• Mulitple networking options
• No one size fits all, know your network and use cases
• Many plugins available to suit any need, but it’s on you
Networking

Considerations for Jobs

• Summary: The systems that power these workloads typically run on other
platforms/orchestrators. Most commonly these are analytic workloads on other
platforms
• Examples: MapReduce, Apache Hive + Tez, Apache Spark
• Benefits of containerization
• Packaging dependent libraries in Docker image
• Challenges
• Data locality and networking considerations
• Security: User identity propagation from base OS
Workloads: Consideration for Batch/Ephemeral/Interactive

Considerations for Services

• Summary: These are typically serving systems with varying requirements.
• In many cases low latency online serving use cases that have specific resource requirements.
• Examples: Apache HBase, Apache Spark SQL/Streaming, Apache Storm, Apache Hive LLAP
• Benefits of containerzation
• Ease of deployment
• Horizontal scaling
• Challenges
• Client access (discovery, relocation, load-balancing)
• Data locality considerations including short circuit reads
• Token/key expiration
• Management: config changes, upgrades, monitoring etc. - not needed for jobs
Consideration for Long Running Services

Considerations for Platforms

• Summary: Workloads run on these platforms.
• Many platforms expect that they “own” the hardware/VM, which differs from workloads.
• => harder to containerize
• Examples: YARN, K8s, Docker Swarm, Nomad, Mesos ...
• Benefits for containerizing these platforms
• Hardware utilization across platforms
• Leverage existing HW investment of these platforms for more apps
• Two choice:s containerize the platform or run containers on top of platform (YARN)
• Developer platform clusters without separate hardware
• Challenges
• Resource sharing (next slide)
• Networking considerations
• User propagation
Consideration for Platforms

• Resource management challenges – Running Two schedulers
• Running YARN workloads on top of K8s
• Running K8s as a long running service on YARN
• Side-by-side, but have some elastic CPU that is moved back or forth
• Full scheduler integration? Tracking jobs in YARN.
• How are resources shared?
• CPU is elastic and shareable, but memory is not
• Unused CPU can be used elsewhere, but not memory
– So unused capacity on YARN, Spark, or Hive LLAP nodes may not be used by other containers
• How do the schedulers cooperate to communicate resource consumption?
• Cgroups for resource tracking
• Containers added and removed on demand / possibly resized
• What about storage platforms (HDFS)?
• DataNodes are not moveable, storage is not elastic
• IO bandwidth is shareable, but dedicated IO is critical to HDFS/Big Data apps
• Leads to use of fat containers or bare metal for DataNodes and YARN workloads
Running platforms on container platforms

Summary
• Many nuances depending on the workload and systems
• Each application requires its own considerations when running in containers.
• Many benefitsto running big data workloads in containers
• But it is important to understand the trade-offs that must be made.
• How and what you do is important

Our Approach Going Forward
• Every Application and Framework will be containerized
• Can run on K8s or YARN containers
• Address security concerns
• Root, User propagation etc.
• For fined grained scheduling user built-in scheduler (Hive) or YARN
• Let the bottom container layer do the main scheduling
• K8S if you have that, or YARN if you do not
• Hybrid - Cloud and On-Prem has same Experience & Architecture
• Some of the cloudy experience comes to on-prem
• Elasticity (within limits of on-prem capacity)
• Business agility

39 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Cloud Storage Cloud Compute
Hive Spark NiFi
Ozone/HDFS K8s/YARN
YARN Kafka
H O R T O N W O R K S D A T A P L A N E S E R V I C E S
CLOUD
Workload specific clusters
User provisioned
CONTAINERS ON PREM
Workload specific clusters
User provisioned
Atlas, Ranger,
Metastore, Knox
SHARED
meta-services spans
multiple clouds and
colos
Hive Spark NiFi
Hive Spark NiFi
Hive Spark NiFi
YARN Kafka

Thank you

Extra Slides

TODO: Talk about Other Services
Very nuanced
- Kafka
- Solr
- ZooKeeper

More details on our internal container cloud running on YARN
Link to Sunitha’s talk

Motivations for Kubernetes
• Virtualized environment - CPU, Memory,…
• Elasticity for apps/services
• Business agility – a new service up an running
• Improved utilization
• Consolidation of on single platform
• Simpler IT management
• Cheaper than VMware - but only Linux Apps
• Storage is orthogonal

Hadoop & Kubernetes:
Side-by-side or Hadoop on top of Kubernetes?
• How do Big Data workloads compare to other apps typically run on Kubernetes
• Other apps services simply use some resources to perform some work
• You size the container to fit the work, add more containers to scale horizontally
• Unused container capacity? Size better and avoid using Horizontal scaling
• Hadoop and some of their Apps
• Have their own schedulers for process querying , workloads
• YARN, Hive, Spark
• What to do about unused capacity
• Horizontal Scaling helps but each container is large to allow potentailly large workload
• Can scale down the containers but not the size – hence what do do about unused capacity?
• Storage
• Hadoop apps have very very large storage needs – a large multi-petabyte DataLake

Full Decomposition Requires Modification
Fat Containers Microservices

Current State of
Containerization in
Hadoop YARN

Powerful Container Model in YARN
• Support for Docker containers
• Packaging and resource isolation
• Packing easier e.g. TensorFlow
• Complements YARN’s support for long running services
• Examples – ML/AI becomes simpler
• Run Spark as regular Yarn jobs or as Docker
• Allows ability to run pre-packed Spark/ML applications
• Run TensorFlow with or without Docker
• Affinity/anti-Affinity for GPUs
• Can securely access HDFS
• HDFS 3.0 will come with TechPreview of TensorFlow framework on Yarn
• Yarn on Yarn …..

Hadoop Apps
YARN on YARN… Towards a Private Cloud …YCloud
YARN
MR Tez Spark
Tensor
Flow
YARN
MR Tez Spark
Spark

Containers and Big Data

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Containers and Big Data

Similaire à Containers and Big Data (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Containers and Big Data