3 examples for Big Data analytics containerized:
1. The installation with Docker and Weave for small and medium,
2. Hadoop on Mesos w/ Appache Myriad
3. Spark on Mesos
Big Data in Container; Hadoop Spark in Docker and Mesos
1. 1
Big Data in Container
Heiko Loewe @loeweh
Meetup Big Data Hadoop & Spark NRW 08/24/2016
2. 2
Why
• Fast Deployment
• Test/Dev Cluster
• Better Utilize Hardware
• Learn to manage Hadoop
• Test new Versions
• An appliance for continuous
integration/API testing
3. 3
Design
Master Container
- Name Node
- Secondary Name Node
- Yarn
Slave Container
- Node Manager
- Data Node
Slave Container
- Node Manager
- Data Node
Slave Container
- Node Manager
- Data Node
Slave Container
- Node Manager
- Data Node
4. 4
More than 1 Hosts needs Overlay Net
Interface Docker0 not routed
Overlay Network
1 Host Config
(almost ) no
Problem
For 2 Hosts
and more
we need an
Overlay Net-
work
5. 5
Choice of the Overlay Network Impl.
Docker Multi-Host Network Weave Net
• Backend: VXLAN, AWS, GCE.
• Fallback: custom UDP-based
tunneling.
• Control plane: built-in, uses Etcd
for shared state.
CoreOS Flanneld
• Backend: VXLAN.
• Fallback: none.
• Control plane: built-in, uses
Zookeeper, Consul or Etcd for
shared state.
• Backend: VXLAN via OVS.
• Fallback: custom UDP-based
tunneling called “sleeve”.
• Control plane: built-in.
6. 6
Normal mode of operations is called FDP – fast
data path – which works via OVS’s data path
kernel module (mainline since 3.12). It’s just
another VXLAN implementation.
Has a sleeve fallback mode, works in userspace
via pcap.
Sleeve supports full encryption.
Weaveworks also has Weave DNS, Weave
Scope and Weave Flux – providing
introspection, service discovery & routing
capabilities on top of Weave Net.
WEAVE NET
7. 7
/etc/sudoers
# at the end:
vuser ALL=(ALL) NOPASSWD: ALL
# secure_path, append /usr/local/bin for weave
Defaults secure_path =
/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin
sudo groupadd docker
sudo gpasswd -a ${USER} docker
sudo chgrp docker /var/run/docker.sock
alias docker="sudo /usr/bin/docker"
Docker Adaption (Fedora/Centos/RHEL)
14. 14
• Container (like Docker) are the Foundation for agile
Software Development
• The initial Container Design was stateless (12-factor
App)
• Use-cases are grown in the last few Month
(NoSQL, Stateful Apps)
• Persistence for Container is not easy
The Problem
15. 15
• Enables Persistence of Docker Volumes
• Enables the Implementation of
– Fast Bytes (Performance)
– Data Services (Protection / Snapshots)
– Data Mobility
– Availability
• Operations:
– Create, Remove, Mount, Path, Unmount
– Additonal Option can be passed to the Volume Driver
DOCKER Volume Manager API
16. 16
Persistente Volumes for CONTAINER
Container OS
Storage
/mnt/PersistentData
Container Container
-v /mnt/PersistenData:/mnt/ContainerData
Container Container
Docker Host
17. 17
Docker Host
Persistente Volumes for CONTAINER
Container OS
Storage
/mnt/PersistentData
Container Container
-v /mnt/PersistenData:/mnt/ContainerData
Container Container
20. 20
Overlay Network
Strech Hadoop w/ persisten Volumes
Host A
Host B
Easiely strech
and shrink a
Cluster without
loosing the Data
21. 21
Other similar Projects
• Big Top Provisioner / Apache Foundation
https://github.com/apache/bigtop/tree/master/provisioner/docker
• Building Hortonworks HDP on Docker
http://henning.kropponline.de/2015/07/19/building-hdp-on-docker/
https://hub.docker.com/r/hortonworks/ambari-server/
https://hub.docker.com/r/hortonworks/ambari-agent/
• Building Cloudera CHD on Docker
http://blog.cloudera.com/blog/2015/12/docker-is-the-new-quickstart-option-for-
apache-hadoop-and-cloudera/
https://hub.docker.com/r/cloudera/quickstart/
Watch out Overlay Network topix
31. 31
What about the Data
Myriad only cares for the Compute
Master Container
- Name Node
- Secondary Name Node
- Yarn
Slave Container
- Node Manager
- Data Node
Slave Container
- Node Manager
- Data Node
Slave Container
- Node Manager
- Data Node
Slave Container
- Node Manager
- Data Node
Myriad/
Mesos
Cares about
Has to be provided
Outside from
Myriad/Mesos
Has to be provided
Outside from
Myriad/Mesos
32. 32
What about the Data
• Myriad only cares for Compute / Map Reduce
• HDFS has to be provided on other Ways
Big Data New Realities
Big Data Traditional
Assumptions
Bare-metal
Data locality
Data on local disks
Big Data
New Realities
Containers and VMs
Compute and storage
separation
In-place access on
remote data stores
New Benefits
and Value
Big-Data-as-a-
Service
Agility and
cost savings
Faster time-to-
insights
33. 33
Options for HDFS Data Layer
• Pure HDFS Cluster (only Data Node running)
– Bare Metal
– Containerized
– Mesos based
• Enterprise HDFS Array
– EMC Isilon
35. 35
• Multi Tenancy
• Multiple HDFS Environments
sharing the same storage
• Quota possible on HDFS
Environments
• Snapshots of HDFS Environemnt
possible
• Remote Replication
• Worm Option for HDFS
• High Avaiable HDFS
Infrastructure (distributed
Namen and Data Nodes)
• Storage efficient (usable/raw 0.8
compared to 0.33 with Hadoop)
• Shared Access HDFS / CIFS /
NFS/SFTP possible
• Maintenance equals Enterprise
Array Standard
• All major Distributions supported
EMC Isilon Advantages over classic
Hadoop HDFS
Ok, so there really is a way to do this, but this means tons of work.
These Dev guys want everything instant and i am just one person.
How should I be able to deliver this?
Ok, so there really is a way to do this, but this means tons of work.
These Dev guys want everything instant and i am just one person.
How should I be able to deliver this?
Ok, so there really is a way to do this, but this means tons of work.
These Dev guys want everything instant and i am just one person.
How should I be able to deliver this?