9. HADOOP PROVISIONG ISSUES
Each cloud provider has a proprietary API
Create images for each provider
Network configuration
Service discovery
Resize, failover, member join support
10. OUR APPROACH – DETAILS
Build your Docker image
Install or pre-install Hadoop services with Ambari
Install Serf and dnsmasq
Build your cloud image
Use Ansible to create an image
Provision the cluster
11. BUILD DOCKER IMAGES
Create the Dockerfile
Have Docker.io to build the image
Optionally pre-install services
Use Ambari
Push image to Docker.io
Licensing questions
12. BUILD CLOUD IMAGES
Use a Docker ready base image
Use Ansible to provision the image template
Pull the Docker images
Apply custom infrastructure
Use cloud provider specific playbooks
AWS EC2
Azure
13. ANSIBLE
Configuration as data
Simplest way to automate IT
Secure and agentless
Goal oriented
One playbook – multiple modules
We use it to “burn” cloud images/templates
14. PROVISIONING – ISSUES
FQDN
/etc/hosts is read-only in Docker
Everybody needs to know everybody
DNS
Single point of failure
Dynamic cluster – nodes joining, leaving, failing
Routing
Cloud – ability to inter-host container routing
Collision free private IP range for Docker bridge
We need predefined host names/IP addresses
/etc/hosts is read-only in Docker
Use Ansible to provision the image template
Pull the Docker images
Start a DNS server
Use it as a reference docker run -dns <IP_OF_DNS>
Nodes need to know each other
15. PROVISIONING – SOLUTION
FQDN
Use –h and –dns Docker params
DNS
dnsmasq is running on each Docker container
Serf member-xxx events trigger dnsmasq reconfiguration
Routing
Docker bridge configuration – follows a convention
16. SERF
Gossip based membership
Service discovery
Decentralized
Lightweight, fault tolerant
Highly available
DevOps friendly
Keep an eye on Consul, Open vSwitch, pipework
17. SERF – DECENTRALIZED SERVICE DISCOVERY
Gossip instead of heartbeat
LAN, WAN profiles
Provides membership information
Event handlers: member_join, member_leave, member_failed, member-
update, member-reap, user
Query
21. AWS EC2 – HADOOP CLUSTER
Use EC2 REST API to provision instances (from Dockerized image)
Start Docker containers
One Ambari server
N-1 Ambari agents connecting to server
Connect ambari-shell to
Define blueprint
Provision the cluster
23. AWS EC2 - CLOUDFORMATION
Manually set up VPC is too complicated
Use CloudFormation
Manage the stack together
Template-based
Environments under version control
Customizable at runtime
No extra charge
"VpcId" : {
"Type" : "String",
"Description" : "VpcId of your existing Virtual Private Cloud (VPC)"
},
"SubnetId" : {
"Type" : "String",
"Description" : "SubnetId of an existing subnet (for the primary
network) in your Virtual Private Cloud (VPC)"
},
"SecondaryIPAddressCount" : {
"Type" : "Number",
"Default" : "1",
"MinValue" : "1",
"MaxValue" : "5",
"Description" : "Number of secondary IP addresses to assign to the
network interface (1-5)",
"ConstraintDescription": "must be a number from 1 to 5."
},
"SSHLocation" : {
"Description" : "The IP address range that can be used to SSH to the
EC2 instances",
"Type": "String",
"MinLength": "9",
"MaxLength": "18",
"Default": "0.0.0.0/0",
"AllowedPattern": "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})/
(d{1,2})",
"ConstraintDescription": "must be a valid IP CIDR range of the form
x.x.x.x/x."
}
},
24. CLOUDBREAK
Cloudbreak is a powerful left surf that
breaks over a coral reef, a mile off
southwest the island of Tavarua, Fiji.
Cloudbreak is a cloud-agnostic
Hadoop as a Service API. Abstracts
the provisioning and ease
management and monitoring of on-
demand clusters.
Provisioning Hadoop has never been easier
25. CLOUDBREAK
Benefits
Elastic
Scalable
Blueprints
Flexible
Main REST resources
/template – specify a cluster infrastructure
/stack – creates a cloud infrastructure built from a template
/blueprint – describes a Hadoop cluster
/cluster – creates a Hadoop cluster
26. RESULTS AND ACHIEVEMENTS
Hadoop as a Service API
Available for EC2 and Azure cloud
OpenStack, bare metal is coming soon
Open source under Apache 2 licence
Same goals as Apache Ambari Launchpad project
What's next?
27. HADOOP SERVICES - AS A SERVICE
Leverage YARN
Slider (Hoya) providers
HBase, Accumulo
SequenceIQ providers - Flume, Tomcat
YARN -1964
QoS for YARN – heuristic scheduler
Platform as a Service API
28. BANZAI PIPELINE
Banzai Pipeline is a surf reef break located
in Hawaii, off Ehukai Beach Park in
Pupukea on O'ahu's North Shore.
Banzai Pipeline is a RESTful
application development
platform for building on-
demand data and job pipelines
running on Hadoop YARN.
Banzai Pipeline is a big data API for the REST
29. THANK YOU
Get the code: https://github.com/sequenceiq
Read about: http://blog.sequenceiq.com
Facebook: http://facebook.com/sequenceiq
Twitter: http://twitter.com/sequenceiq
LinkedIn: http://linkedin.com/sequenceiq
Contact: janos.matyas@sequenceiq.com
FEEL FREE TO CONTRIBUTE
Notes de l'éditeur
YAML
Dev – env : use default Docker bridge (easy)
-h for hostname, --dns to specify the DNS service to use
Convention: AMI launch index
Fire and forget
Waits for anwer – limited response collection