What it takes to run Hadoop at Scale: Yahoo! Perspectives
1. W h a t i t Ta k e s t o R u n H a d o o p a t S c a l e :
Ya h o o P e r s p e c t i v e s
P R E S E N T E D B Y S u m e e t S i n g h , R a j i v C h i t t a j a l l u ⎪ J u n e 1 1 , 2 0 1 5
H a d o o p S u m m i t 2 0 1 5 , S a n J o s e
2. Introduction
2
Senior Engineer with the Hadoop Operations team at
Yahoo
Involved with Hadoop since 2006, starting with the
early 400-node to over 42,000-node prod env. today
Started with Center for Development of Advanced
Computing in 2002 before joining Yahoo in
BS degree in Computer Science from Osmania
University, India
Rajiv Chittajallu
Sr. Principle Engineer
Hadoop Operations
701 First Avenue,
Sunnyvale, CA 94089 USA
@rajivec
Manages Cloud Storage and Big Data products team
at Yahoo
Responsible for Product Management, Strategy and
Customer Engagements
Managed Cloud Engineering products teams and
headed Strategy functions for the Cloud Platform
Group at Yahoo
MBA from UCLA and MS from RPI
Sumeet Singh
Sr. Director, Product Management
Cloud Storage and Big Data Platforms
701 First Avenue,
Sunnyvale, CA 94089 USA
@sumeetksingh
3. Hadoop a Secure Shared Hosted Multi-tenant Platform
3
TV
PC
Phone
Tablet
Pushed Data
Pulled Data
Web Crawl
Social
Email
3rd Party Content
Data Highway
Hadoop Grid
BI, Reporting, Adhoc Analytics
Data
Content
Ads
No-SQL
Serving Stores
Serving
4. Platform Evolution (2006 – 2015)
4
0
100
200
300
400
500
600
700
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
50,000
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
RawHDFS(inPB)
#Servers
Year
Servers Storage
Yahoo!
Commits to
Scaling
Hadoop for
Production Use
Research
Workloads
in Search and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems
with Security,
Multi-
tenancy, and
SLAs
Open
Sourced with
Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen
Hadoop
(H 0.23 YARN)
New Services
(HBase,
Storm, Spark,
Hive)
Increased
User-base
with partitioned
namespaces
Apache H2.6
(Scalable ML, Latency,
Utilization, Productivity)
Server
s
Use Cases
Hadoo
p
43,000 300
HBase 3,000 70
Storm 2,000 50
5. Top 10 Considerations for Scaling a Hadoop-based Platform
5
On-Premise or Public Cloud
Total Cost Of Ownership (TCO)
Hardware Configuration
2
3
Network4
Software Stack5
6
7
8
10
Security and Account Management
Data Lifecycle Management and BCP
Metering, Audit and Governance
9 Integration with External Systems
Debunking Myths
1
6. On-Premise or Public Cloud – Deployment Models
6
1
Private (dedicated)
Clusters
Hosted Multi-tenant
(private cloud)
Clusters
Hosted Compute
Clusters
Large demanding use
cases
New technology not
yet platformized
Data movement and
regulation issues
When more cost
effective than on-
premise
Time to market/
results matter
Data already in
public cloud
Source of truth for all
of orgs data
App delivery agility
Operational efficiency
and cost savings
through economies of
scale
On-Premise Public Cloud
Purpose-built
Big Data
Clusters
For performance,
tighter integration
with tech stack
Value added services
such as monitoring,
alerts, tuning and
common tools
7. On-Premise or Public Cloud – Selection Criteria
7
1
Fixed, does not vary with utilization
Favors scale and 24x7 centralized ops
Variable with usage
Favors run and done, decentralized ops
Cost
Aggregated from disparate or distributed
sources
Typically generated and stored in the
cloud
Data
Job queue, cap. sched., BCP, catchup
Controlled latency and throughput
No guarantees (beyond uptime) without
provisioning additional resources
SLA
Control over deployed technology
Requires platform team/ vendor support
Little to no control over tech stack
No need for platform R&D headcount
Tech Stack
Shared env., control over data
/movement, PII, ACLs, pluggable
security
Data typically not shared among users
in the cloud
Security
Matters, complex to develop and
operate
Does not matter, clusters are dynamic/
virtual and dedicated
Multi-
tenancy
On-Premise Public CloudCriteria
8. On-Premise or Public Cloud – Evaluation
8
1
On-Premise
Public Cloud
Cost
Data
SLA
Tech Stack
Security
Multi-tenancy
9. On-Premise or Public Cloud – Utilization Matters
9
1
Utilization / Consumption (Compute and Storage)
Cost($)
On-premise
Hadoop as a
Service
On-demand
public cloud
service
Terms-based
public cloud
service
Favors on-premise
Hadoop as a Service
Favors public cloud
service
x
x
Current and expected
or target utilization
can provide further
insights into your
operations and cost
competitiveness
Highstartingcost
Scalingup
Crossover
point 1
10. Total Cost Of Ownership (TCO) – Components
10
2
$2.1 M
60%
12%
7%
6%
3%
2%
6
5
4
3
2
1
7
10%
Operations Engineering
Headcount for service engineering and data operations teams responsible for day-to-day ops and
support
6
Acquisition/ Install (One-time)
Labor, POs, transportation, space, support, upgrades, decommissions, shipping/ receiving etc.
5
Network Hardware
Aggregated network component costs, including switches, wiring, terminal servers, power strips etc.
4
Active Use and Operations (Recurring)
Recurring datacenter ops cost (power, space, labor support, and facility maintenance
3
R&D HC
Headcount for platform software development, quality, and release engineering
2
Cluster Hardware
Data nodes, name nodes, job trackers, gateways, load proxies, monitoring, aggregator, and web servers
1
Monthly TCOTCO Components
Network Bandwidth
Data transferred into and out of clusters for all colos, including cross-colo transfers
7
ILLUSTRATIVE
11. Total Cost Of Ownership (TCO) – Unit Costs (Hadoop)
11
2
Container memory
where apps perform
computation and
access HDFS if
needed
Container CPU
cores used by apps
to perform
computation / data
processing
Network bandwidth
needed to move
data into/out of the
clusters by the app
$ / GB-Hour (H 2.0+)
GBs of Memory
available for an hour
Monthly Memory Cost
Avail. Memory Capacity
$ / vCore-Hour (H 2.6+)
vCores of CPU
available for an hour
Monthly CPU Cost
Avail. CPU vCores
Unit
Total Capacity
Unit Cost
$ / GB of data stored
Usable storage space
(less replication and
overheads)
Monthly Storage Cost
Avail. Usable Storage
$ / GB for Inter-region
data transfers
Inter-region (peak) link
capacity
Monthly BW Cost
Monthly GB In + Out
Files and
directories used by
the apps to
understand/ limit the
load on NN)
HFDS (usable)
space needed by
an app with default
replication factor of
three
12. Total Cost Of Ownership (TCO) – Consumption Costs
12
2
Map GB-Hours = GB(M1) x
T(M1) + GB(M2) x T(M2) +
…
Reduce GB-Hours = GB(R1)
x T(R1) + GB(R2) x T(R2) +
…
Cost = (M + R) GB-Hour x
$0.002 / GB-Hour / Month
= $ for the Job/ Month
(M+R) GB-Hours for all
jobs can summed up for
the month for a user, app,
BU, or the entire platform
Monthly Job
and Task
Cost
Monthly Roll-
ups
Map vCore-Hours =
vCores(M1) x T(M1) +
vCores(M2) x T(M2) + …
Reduce vCore-Hours =
vCores(R1) x T(R1) +
vCores(R2) x T(R2) + …
Cost = (M + R) vCore-Hour
x $0.002 / vCore-Hour /
Month
= $ for the Job/ Month
(M+R) vCore-Hours for all
jobs can summed up for the
month for a user, app, BU,
or the entire platform
/ project (app) quota in GB
(peak monthly used)
/ user quota in GB (peak
monthly used)
/ data as each user
accountable for their portion
of use. For e.g.
GB Read (U1)
GB Read (U1) + GB Read
(U2) + …
Roll-ups through
relationship among user,
file ownership, app, and
their BU
Bandwidth measured at the
cluster level and divided
among select apps and
users of data based on
average volume In/Out
Roll-ups through
relationship among user,
app, and their BU
13. Hardware Configuration – Physical Resources
13
3
.
.
.
.
Datacenter 1
Rack 1 Rack N
.
.
Clusters in Datacenters Server Resources
C-nn / 64,128,256 G / 4000, 6000 etc.
14. Hardware Configuration – Eventual Heterogeneity
14
3
24 G 8 cores SATA 0.5 TB
48 G 12 cores SATA 1.0 TB
64 G Harpertown SATA 2.0 TB
128 G Sandy Bridge SATA 3.0 TB
192 G Ivy Bridge SATA 4.0 TB
256 G Haswell SATA 6.0 TB
384 G
Heterogeneous Configurations:
10s of configs of data nodes
(collected over the years) without
dictating scheduling decisions –
let the framework balance out the
configs
Heterogeneous Storage:
HDFS supports heterogeneous
storage (HDD, SSD, RAM, RAID
etc.) – HDFS-2832, HDFS-5682
Heterogeneous Scheduling:
operate multiple purpose
hardware in the same cluster
(e.g. GPUs) – YARN 796
15. Network – Common Backplane
15
4
DataNode NodeManager
NameNode RM
DataNodes RegionServers
NameNode HBase Master Nimbus
Supervisor
Administration, Management and Monitoring
ZooKeeper
Pools
HTTP/HDFS/GDM
Load Proxies
Applications and Data
Data
Feeds
Data
Stores
Oozie
Server
HS2/
HCat
Network
Backplane
16. Network – Bottleneck Awareness
16
4
Hadoop
Cluster
(Data Set 1)
Hadoop
Cluster
(Data Set 2)
HBase Cluster
(Low-latency
Data Store)
Storm Cluster
(Real-time /
Stream
Processing)
Large dataset joins
or data sharing over
network
1
Large extractions
may saturate the
network
2
Fast bulk updates
may saturate the
network
3 Large data
copies may
not be
possible
4
17. Network – 1/10G BAS (Rack Locality Not A Major Issue)
17
4
RSW
…
…
…
N x
RSW RSW
BAS1-1 BAS1-2
FAB 1 FAB 2 FAB 3 FAB 4 FAB 5 FAB 6 FAB 7 FAB 8
L3
Backplane
RSW
…
…
…
N x
RSW RSW
BAS8-1 BAS8-2
L3
Backplane
…
1 Gbps
2:1 oversubscription
10 Gbps
8 x 10 Gbps
Fabric
Layer
48 racks, 15,360 hosts
SPOF!
21. Software Stack – Obsess With Use Cases, Not Tech
21
5
HDFS
(File System)
YARN
(Scheduling, Resource Management)
Common
In-
progress,
Unmet
needs or
Apache
Alignment
Platformized
Tech with
Production
Support
RHEL6 64-bit, JDK8
22. Security and Account Management – Overview
22
6
Grid
Identity,
Authentication and
Authorization
User Id
SSO
Groups, Netgroups, Roles
RPC (GSSAPI)
UI (SPNEGO)
24. Data Lifecycle Management and BCP
24
7
Acquisition
Replication
(Feeds)Source
Retention
(Policy based
Expiration)
Archival
(Tape Backup)
DataOut
Data Lifecycle
Datastore
Datastore defines a data
source/target (e.g. HDFS)
Dataset
Defines the data flow of a feed
Workflow
Defines a unit of work carried
out by acquisition, replication,
retention servers for moving
an instance of a feed
25. Data Lifecycle Management and BCP
25
7
MetaStore
Cluster 1 - Colo 1
HDFS
Cluster 2 – Colo 2
HDFS
Grid Data
Management
Feed Acquisition
MetaStore
Feed datasets as
partitioned external
tables
Growl extracts
schema for backfill
HCatClient.
addPartitions(…)
Mark
LOAD_DONE
HCatClient.
addPartitions(…)
Mark
LOAD_DONE
Partitions are dropped with
(HCatClient.dropPartitions(…)) after
retention expiration with a
drop_partition notification
add_partition
event notification
add_partition
event notification
Acquisition
Archival,
Dataout
Retention
Feed
Replication
26. Metering, Audit, and Governance
26
8
Starling
FS, Job, Task logs
Cluster 1 Cluster 2 Cluster n...
CF, Region, Action, Query Stats
Cluster 1 Cluster 2 Cluster n...
DB, Tbl., Part., Colmn. Access Stats
...MS 1 MS 2 MS n
GDM
Data Defn., Flow, Feed, Source
F 1 F 2 F n
Log Warehouse
Log Sources
27. Metering, Audit, and Governance
27
8
Data Discovery and Access
Public
Non-sensitive
Financial
Restricted
$
Governance
Classification
No addn. reqmt.
LMS Integration
Stock Admin
Integration
Approval Flow
28. Integration with External Systems
28
9
BI, Reporting, Transactional DBs
Hadoop Customers
…
DH
Cloud Messaging
Serving Systems
Monitoring, Tools, Portals
Infrastructure in Transition
29. Debunking Myths
29
10
Hadoop isn’t enterprise ready
Hadoop isn’t stable, clusters go down
You lose data on HDFS
Data cannot be shared across the org
NameNodes do not scale
Software upgrades are rare✗
Hadoop use cases are limited
I need expensive servers to get more
Hadoop is so dead
I need Apache this vs. that
✗
✗
✗
✗
✗
✗
✗
✗
✗
(30 secs)
T: 30 secs
Let’s get started. I always wanted to document and talk about some of the top considerations that go into making Hadoop a scalable platform for the entire business, and this was a perfect opportunity to do so.
(30 secs)
T: 1 min
My name is Sumeet Singh, and I am a Sr. Director of Products for Hadoop and Big Data platforms at Yahoo. I have been working at Yahoo for nearly three and a half / four years now, and had few different roles in my time there. The top 10 list I am presenting is what I came up with. You may come up with a different list, or think of other things that may be more important for your businesses. Let me set some context first on Hadoop at Yahoo, and then we will jump right into it.
(1 min)
T: 2 min
Hadoop is a secure shared platform at Yahoo, lot of people call it the public grid, but it’s a private cloud. Yahoo products and properties across devices generate a lot of data that is of immense value to us in driving new and interesting use experiences across devices. All that data comes to Hadoop that acts as a single source of truth for all data at Yahoo. A wide variety of other data also gets pulled into the Hadoop Grid from various sources as shown. The idea is to consolidated data from all over was the company from disparate sources at once place so that it can be (a) shared (b) enriched (c) de-duped (d) kept up to date. That data once processed is applied back or served as value to our consumers in the form of personalized experiences across our products and properties. And of course is used for reporting and analytics. All of this is done while keeping the web scale economics and cost in mind.
(30 secs)
T: 2 min 30 secs
With improving platform capabilities and robustness, the workload over the years have systematically moved from research into production and revenue systems. The use cases running on platform have continued to grow and we reached a point quickly where we started to depend on it. The growth is evident from the storage we have added at roughly 20-25% CAGR in the last five years.
(1 min)
T: 3 min 30 secs
Ok, so how did we get here? In the rest of the talk, I want to walk you through some of the top considerations that go into making the Hadoop a scalable platform to run your entire business on.
First one, of course is where you run and how you should think about it
(2) Second is the cost considerations and knowing who and what costs how much
(3) Third, of course is how to choose and think about hardware configurations
(4) Network is a big piece of the puzzle, and we will talk about that
(5) Software stack that you want to run on your platform
(6) Security is of course a big area, we will not get too detailed into any one of these
(7) Data movement or lifecycle management, including BCP provisions
(8) Then, we will get into audit and governance
(9) How well Hadoop works with other/ external systems
(10) And, finally, talking about some of the issues or non-issues when you make architectural choices on the platform.
(1 min)
T: 4 min 30 secs
Perhaps the first consideration would be where you want to run your workloads, on an on-premise infrastructure or public cloud. With on-premise, you can go with two different deployment models, (1) private cluster that you would often setup for large demanding use cases, newer technology or for regulatory reasons, (2) and then there is shared grid which is source of truth for most data at the company.
Public cloud models also come in two flavors. (1) You acquire hosted compute clusters and run Hadoop on them mostly for setup time or cost reasons or if your data is already generated and stored in public cloud. (2) Purpose-built big data clouds are also becoming popular with more intense out of the box integration among offered stack components
(2 min)
T: 6 min 30 secs
There is no right or wrong answer when you are thinking about which model to go after. Depends on your particular situation, your workloads, your plans for the future, adjacencies and other investments in cloud etc. Let’s consider some of the aspects that makes you sway one way or the other.
(1) Cost – On-premise requires investments in setup and maintenance, public cloud has a convenience aspect to it as well.
(2) Data – You need to think about data movement and data serving. One of the most important aspects to consider
(3) SLA – There are multiple mechanism to make sure a variety of SLAs are covered, not just uptime
(4) Tech Stack – You can deploy what you need at both places, but you have better control over the stack and integration with other aspects of cloud, such as monitoring, serving stack (in our case cloud/ data serving, ad serving, and content serving)
(5) Security – one of the deepest topic you need to think about.
(6) Security and multi-tenancy is hard. If your plans are to scale up your infrastructure, you need to think about it. Company policies also impacts your software and operations.
(30 secs)
T: 7 min
Cost and SLAs (which is price and performance) is something you can evaluate quantitatively, but the other factors require careful consideration and choices. A single factor can sway you in one direction or the other.
(1 min)
T: 8 min
So when does an on-premise infrastructure makes sense. When you amortize your investments over a higher usage of resources. In a nutshell, as you benchmark, you will realize that as utilization/ consumption increase, eventually, you will start to look fairly attractive compared with a public cloud model. Do not stop at the cost, conduct a sensitivity analysis based on your expected/ target utilization before you make a decision.
(2 min)
T: 10 min
Knowing your true cost of operating a platform is not only important for metering but also giving projects running on the platform a good idea of their ROI and profitability.
This is perhaps one of the most difficult exercises as it requires running around to gather data which is often hard to get. Nonetheless, this is the first step and required. We will walk our way up.
Hardware – not just the namenodes and datanodes, but all kinds of other things involved. Also, most hadoop servers have compute and jbods, so remember this that is all things hardware.
R&D HC – this is not for writing or developing apps on top of hadoop, this is HC needed for platform development
Ops Cost – power is one of the biggest elements here.
Network – can be substantial depending on the backplane capacity and architectural choices
Acq./ Install – one time costs
Ops – this is where Yahoo excels. Number of people needed to maintain the infra.
Network bandwidth – can be significant depending on your operation.
As you understand your total cost of ownership, you can get to a monthly figure. Remember to properly depreciate assets and bring it to a monthly run rate (opex + capex).
(2 min)
T: 12 min
Once you have your monthly TCO, it is time to understand your unit costs. So, let’s talk about resources you are going to consume first.
Compute (Memory) is for the YARN containers where your Map or Reduce tasks run. The unit is $/GB-hr/month. You need to know your monthly compute TCO and total compute capacity
Compute (CPU) is for YARN containers in terms of CPU vCores. 10 vCores per hyper threaded core or 20 vCores per physical CPU core.
Storage – usable storage space instead of raw storage space.
Bandwidth – calculated at the cross-colo link level, but you get your monthly bill based on your portion of data in and out. So, your unit cost is $/(data in+out).
Namespace – while we do not cost it out separately for the namenode, it is important to manage it. We have allocations/ quota, simply the sum of all the files and directories.
(2 min)
T: 14 min
Remember the monthly TCO we calculated earlier. I am splitting my monthly TCO in 30/30/40 among compute memory, compute CPU and storage. You can choose based on your situation. If you buy them separately, then you already know the split.
My total slots or memory in the infrastructure is summed up and converted to hours. You need the notion of time as you will associate the units to usage by the users.
So, total cost/ total capacity (in slot hours) or in usable storage (good so that you can benchmark effectively should you go to cloud systems that charge for usable storage of data, not raw storage capacity).
And similarly for the bandwidth.
(30 secs)
T: 15 min
There are only four resources that are consumed in Hadoop operations when it comes to hardware. Memory, CPU, Storage/ Disk, and Network. We will talk about networks within datacenter at length as our fourth consideration. The best way to think of these resources as configurable when it comes to deciding the type of server hardware you want to setup for hadoop. So, combining these resources into a configuration you can certify for performance evaluation.
(1 min)
T: 16 min
Here are some of the possible combinations, as you can see, even today, our clusters have generations of memory and CPU as well as disk drives that are configured as JBODs giving us well over 10 configurations live in production. Beyond recognizing the configuration for overall pool, we let the framework make decisions when managing a cluster with heterogeneous configuration. As far as storage is concerned, Hadoop has support for multiple types of storage, although we are chugging along find with disk drives. We are now also composing clusters with large memory boxes and GPUs particularly for Machine Learning and let the applications choose those configurations for specific needs based on assigned labels to them.
(2 min)
T: 18 min
Beyond servers, network is another important aspect of the overall setup. We host all shared compute clusters on a dedicated backplane. Within a datacenter, bandwidth between all compute racks is consistent, independent of the clusters. This allows for inter-cluster access with similar transfer rates as intra- cluster racks. Nodes (racks) can be moved between clusters without network reconfiguration or physical moves.
(1 min)
T: 19 min
Big Data Applications can be demanding on the network as they often present many-to-one traffic flows, otherwise known as in-cast. For e.g. you may be joining or accessing data between two Hadoop clusters. Large data extractions between HBase and Hadoop is common for web applications such as in search. Between Storm and HBase where incremental processing works great, fast bulk updates become an issue. Similarly when moving data between Hadoop and storm clusters. We have seen saturations, although mostly TOR switches before overwhelming anything else.
(2 min)
T: 21 min
One way to compose your network among racks is what we called as through BAS switches a.k.a. big ass switches or big box fabric design. (core – distribution – access layers). Some call it aggregation and or core. The host connect is 1G to TOR. Switch uplink is 10G. Minimum all-to-all guarantee across 15,000+ servers this backplane can support is 500Gpbs with a 2:1 oversubscription. Oversubscription is the ratio of contention should all devices send traffic at the same time. (1G copper – 8way – 200Mbps, 10G Fiber – 2-way – 500Mbps). This typical datacenter design can get inflexible to scale and $/server for 10G may be expensive.
Also, given that nodes in Hadoop have IP addresses, you rely on L3 or routing protocols and STP that can again get inflexible and hard.
(2 min)
T: 23 min
A new leaf and spine based architecture on old circuit switched networks (1950s) is becoming popular now for better latency and $/server cost. Now, the RSW can be 48x10G, with 4x40G uplinks. The 4x40G uplinks can be broken down into 16x10G (upto 160G). The number of VCs you construct defines the uplinks, for e.g., 2 VCs with 2x40G. The composition of the VCs define the TORs you can have. Each leaf here can support 16 TOR switches giving you 20,000+ (= 16 * 32 = 512 == 20,480 hosts) hosts that can be connected. Each leaf supports 16 TORs. So, total TOR Each VC is 40 G, 2 VCs. – 80G. 5:1 == 400G / 80G, to decrease oversubscription build more VCs. Number of leaf is always 2x the number of spines for non-blocking. ½ to spine and ½ to racks. Each TOR has 4 x 10G up.
(1 min)
T: 24 min
(1 min)
T: 25 min
(1 min)
T: 26 min
(1 min)
T: 27 min
User accounts in LDAP and netgroups for access control. Automated jobs/workflows via headless accounts. Two Kerberos REALM with one-way trusts - Active Directory for individual users and Hadoop specific REALM for headless users and service principals. The architecture is SOX compliant. Kerberos provides Generic Security Service APIs for RPC auth, and Simple and Protected GSSAPI Negotiation Mechanism (SPNEGO) mechanism is used for web UIs for HTTP auth.
(1 min)
T: 28 min
(1 min)
T: 29 min
(2 min)
T: 31 min
Each site has a corresponding prod / non-prod cluster for user applications to failover to. Feeds are kept in sync using replication. Applications develop BCP strategy using native Hadoop services (e.g. async replication)