SlideShare une entreprise Scribd logo
1  sur  30
W h a t i t Ta k e s t o R u n H a d o o p a t S c a l e :
Ya h o o P e r s p e c t i v e s
P R E S E N T E D B Y S u m e e t S i n g h , R a j i v C h i t t a j a l l u ⎪ J u n e 1 1 , 2 0 1 5
H a d o o p S u m m i t 2 0 1 5 , S a n J o s e
Introduction
2
 Senior Engineer with the Hadoop Operations team at
Yahoo
 Involved with Hadoop since 2006, starting with the
early 400-node to over 42,000-node prod env. today
 Started with Center for Development of Advanced
Computing in 2002 before joining Yahoo in
 BS degree in Computer Science from Osmania
University, India
Rajiv Chittajallu
Sr. Principle Engineer
Hadoop Operations
701 First Avenue,
Sunnyvale, CA 94089 USA
@rajivec
 Manages Cloud Storage and Big Data products team
at Yahoo
 Responsible for Product Management, Strategy and
Customer Engagements
 Managed Cloud Engineering products teams and
headed Strategy functions for the Cloud Platform
Group at Yahoo
 MBA from UCLA and MS from RPI
Sumeet Singh
Sr. Director, Product Management
Cloud Storage and Big Data Platforms
701 First Avenue,
Sunnyvale, CA 94089 USA
@sumeetksingh
Hadoop a Secure Shared Hosted Multi-tenant Platform
3
TV
PC
Phone
Tablet
Pushed Data
Pulled Data
Web Crawl
Social
Email
3rd Party Content
Data Highway
Hadoop Grid
BI, Reporting, Adhoc Analytics
Data
Content
Ads
No-SQL
Serving Stores
Serving
Platform Evolution (2006 – 2015)
4
0
100
200
300
400
500
600
700
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
50,000
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
RawHDFS(inPB)
#Servers
Year
Servers Storage
Yahoo!
Commits to
Scaling
Hadoop for
Production Use
Research
Workloads
in Search and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems
with Security,
Multi-
tenancy, and
SLAs
Open
Sourced with
Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen
Hadoop
(H 0.23 YARN)
New Services
(HBase,
Storm, Spark,
Hive)
Increased
User-base
with partitioned
namespaces
Apache H2.6
(Scalable ML, Latency,
Utilization, Productivity)
Server
s
Use Cases
Hadoo
p
43,000 300
HBase 3,000 70
Storm 2,000 50
Top 10 Considerations for Scaling a Hadoop-based Platform
5
On-Premise or Public Cloud
Total Cost Of Ownership (TCO)
Hardware Configuration
2
3
Network4
Software Stack5
6
7
8
10
Security and Account Management
Data Lifecycle Management and BCP
Metering, Audit and Governance
9 Integration with External Systems
Debunking Myths
1
On-Premise or Public Cloud – Deployment Models
6
1
Private (dedicated)
Clusters
Hosted Multi-tenant
(private cloud)
Clusters
Hosted Compute
Clusters
 Large demanding use
cases
 New technology not
yet platformized
 Data movement and
regulation issues
 When more cost
effective than on-
premise
 Time to market/
results matter
 Data already in
public cloud
 Source of truth for all
of orgs data
 App delivery agility
 Operational efficiency
and cost savings
through economies of
scale
On-Premise Public Cloud
Purpose-built
Big Data
Clusters
 For performance,
tighter integration
with tech stack
 Value added services
such as monitoring,
alerts, tuning and
common tools
On-Premise or Public Cloud – Selection Criteria
7
1
 Fixed, does not vary with utilization
 Favors scale and 24x7 centralized ops
 Variable with usage
 Favors run and done, decentralized ops
Cost
 Aggregated from disparate or distributed
sources
 Typically generated and stored in the
cloud
Data
 Job queue, cap. sched., BCP, catchup
 Controlled latency and throughput
 No guarantees (beyond uptime) without
provisioning additional resources
SLA
 Control over deployed technology
 Requires platform team/ vendor support
 Little to no control over tech stack
 No need for platform R&D headcount
Tech Stack
 Shared env., control over data
/movement, PII, ACLs, pluggable
security
 Data typically not shared among users
in the cloud
Security
 Matters, complex to develop and
operate
 Does not matter, clusters are dynamic/
virtual and dedicated
Multi-
tenancy
On-Premise Public CloudCriteria
On-Premise or Public Cloud – Evaluation
8
1
On-Premise
Public Cloud
Cost
Data
SLA
Tech Stack
Security
Multi-tenancy
On-Premise or Public Cloud – Utilization Matters
9
1
Utilization / Consumption (Compute and Storage)
Cost($)
On-premise
Hadoop as a
Service
On-demand
public cloud
service
Terms-based
public cloud
service
Favors on-premise
Hadoop as a Service
Favors public cloud
service
x
x
Current and expected
or target utilization
can provide further
insights into your
operations and cost
competitiveness
Highstartingcost
Scalingup
Crossover
point 1
Total Cost Of Ownership (TCO) – Components
10
2
$2.1 M
60%
12%
7%
6%
3%
2%
6
5
4
3
2
1
7
10%
Operations Engineering
 Headcount for service engineering and data operations teams responsible for day-to-day ops and
support
6
Acquisition/ Install (One-time)
 Labor, POs, transportation, space, support, upgrades, decommissions, shipping/ receiving etc.
5
Network Hardware
 Aggregated network component costs, including switches, wiring, terminal servers, power strips etc.
4
Active Use and Operations (Recurring)
 Recurring datacenter ops cost (power, space, labor support, and facility maintenance
3
R&D HC
 Headcount for platform software development, quality, and release engineering
2
Cluster Hardware
 Data nodes, name nodes, job trackers, gateways, load proxies, monitoring, aggregator, and web servers
1
Monthly TCOTCO Components
Network Bandwidth
 Data transferred into and out of clusters for all colos, including cross-colo transfers
7
ILLUSTRATIVE
Total Cost Of Ownership (TCO) – Unit Costs (Hadoop)
11
2
Container memory
where apps perform
computation and
access HDFS if
needed
Container CPU
cores used by apps
to perform
computation / data
processing
Network bandwidth
needed to move
data into/out of the
clusters by the app
$ / GB-Hour (H 2.0+)
GBs of Memory
available for an hour
Monthly Memory Cost
Avail. Memory Capacity
$ / vCore-Hour (H 2.6+)
vCores of CPU
available for an hour
Monthly CPU Cost
Avail. CPU vCores
Unit
Total Capacity
Unit Cost
$ / GB of data stored
Usable storage space
(less replication and
overheads)
Monthly Storage Cost
Avail. Usable Storage
$ / GB for Inter-region
data transfers
Inter-region (peak) link
capacity
Monthly BW Cost
Monthly GB In + Out
Files and
directories used by
the apps to
understand/ limit the
load on NN)
HFDS (usable)
space needed by
an app with default
replication factor of
three
Total Cost Of Ownership (TCO) – Consumption Costs
12
2
Map GB-Hours = GB(M1) x
T(M1) + GB(M2) x T(M2) +
…
Reduce GB-Hours = GB(R1)
x T(R1) + GB(R2) x T(R2) +
…
Cost = (M + R) GB-Hour x
$0.002 / GB-Hour / Month
= $ for the Job/ Month
(M+R) GB-Hours for all
jobs can summed up for
the month for a user, app,
BU, or the entire platform
Monthly Job
and Task
Cost
Monthly Roll-
ups
Map vCore-Hours =
vCores(M1) x T(M1) +
vCores(M2) x T(M2) + …
Reduce vCore-Hours =
vCores(R1) x T(R1) +
vCores(R2) x T(R2) + …
Cost = (M + R) vCore-Hour
x $0.002 / vCore-Hour /
Month
= $ for the Job/ Month
(M+R) vCore-Hours for all
jobs can summed up for the
month for a user, app, BU,
or the entire platform
/ project (app) quota in GB
(peak monthly used)
/ user quota in GB (peak
monthly used)
/ data as each user
accountable for their portion
of use. For e.g.
GB Read (U1)
GB Read (U1) + GB Read
(U2) + …
Roll-ups through
relationship among user,
file ownership, app, and
their BU
Bandwidth measured at the
cluster level and divided
among select apps and
users of data based on
average volume In/Out
Roll-ups through
relationship among user,
app, and their BU
Hardware Configuration – Physical Resources
13
3
.
.
.
.
Datacenter 1
Rack 1 Rack N
.
.
Clusters in Datacenters Server Resources
C-nn / 64,128,256 G / 4000, 6000 etc.
Hardware Configuration – Eventual Heterogeneity
14
3
24 G 8 cores SATA 0.5 TB
48 G 12 cores SATA 1.0 TB
64 G Harpertown SATA 2.0 TB
128 G Sandy Bridge SATA 3.0 TB
192 G Ivy Bridge SATA 4.0 TB
256 G Haswell SATA 6.0 TB
384 G
 Heterogeneous Configurations:
10s of configs of data nodes
(collected over the years) without
dictating scheduling decisions –
let the framework balance out the
configs
 Heterogeneous Storage:
HDFS supports heterogeneous
storage (HDD, SSD, RAM, RAID
etc.) – HDFS-2832, HDFS-5682
 Heterogeneous Scheduling:
operate multiple purpose
hardware in the same cluster
(e.g. GPUs) – YARN 796
Network – Common Backplane
15
4
DataNode NodeManager
NameNode RM
DataNodes RegionServers
NameNode HBase Master Nimbus
Supervisor
Administration, Management and Monitoring
ZooKeeper
Pools
HTTP/HDFS/GDM
Load Proxies
Applications and Data
Data
Feeds
Data
Stores
Oozie
Server
HS2/
HCat
Network
Backplane
Network – Bottleneck Awareness
16
4
Hadoop
Cluster
(Data Set 1)
Hadoop
Cluster
(Data Set 2)
HBase Cluster
(Low-latency
Data Store)
Storm Cluster
(Real-time /
Stream
Processing)
Large dataset joins
or data sharing over
network
1
Large extractions
may saturate the
network
2
Fast bulk updates
may saturate the
network
3 Large data
copies may
not be
possible
4
Network – 1/10G BAS (Rack Locality Not A Major Issue)
17
4
RSW
…
…
…
N x
RSW RSW
BAS1-1 BAS1-2
FAB 1 FAB 2 FAB 3 FAB 4 FAB 5 FAB 6 FAB 7 FAB 8
L3
Backplane
RSW
…
…
…
N x
RSW RSW
BAS8-1 BAS8-2
L3
Backplane
…
1 Gbps
2:1 oversubscription
10 Gbps
8 x 10 Gbps
Fabric
Layer
48 racks, 15,360 hosts
SPOF!
Network –10G CLOS (Server Placement Not an Issue)
18
4
Spine 1
Leaf 1
Spine 2
Leaf 2
Leaf 3
Leaf 4
Spine 15 Leaf 29
Leaf 30
Leaf 31
Spine 0
Leaf 0
.
.
.
.
.
.
Virtual Chassis 0
Spine 1
Leaf 1
Spine 2
Leaf 2
Leaf 3
Leaf 4
Spine 15 Leaf 29
Leaf 30
Leaf 31
Spine 0
Leaf 0
.
.
.
.
.
.
Virtual Chassis 1
RSW
N x
RSW RSW
10 Gbps
5:1 oversubscription
16 spines, 32 leafs
2 x 40 Gbps
512 racks, 20,480 hosts
SPOF!
Network – Gen Next
19
4
Source: http://www.opencompute.org
Software Stack – Where are We Today
20
5
Compute
Services
Storage
Infrastructure
Services
Hive
(0.13)
Pig
(0.11, 0.14)
Oozie
(4.4)
HDFS Proxy
(3.2)
GDM
(6.4)
YARN
(2.6)
MapReduce
(2.6)
HDFS
(2.6)
HBase
(0.98)
Zookeeper
Grid UI
(SS/Doppler,
Discovery, Hue 3.7)
Monitoring Starling
Messaging
Service
HCatalog
(0.13)
Storm
(0.9.2)
Spark
(1.3)
Tez
(0.6)
Software Stack – Obsess With Use Cases, Not Tech
21
5
HDFS
(File System)
YARN
(Scheduling, Resource Management)
Common
In-
progress,
Unmet
needs or
Apache
Alignment
Platformized
Tech with
Production
Support
RHEL6 64-bit, JDK8
Security and Account Management – Overview
22
6
Grid
Identity,
Authentication and
Authorization
User Id
SSO
Groups, Netgroups, Roles
RPC (GSSAPI)
UI (SPNEGO)
Security and Account Management – Flexibly Secure
23
6
Kerb Realm 2
(Users)
Kerb Realm 1
(Projects, Services)
IdP
SP
CLIENTS
CORP
PROD
Auth
User SSO
Netgroups
Hadoop RPC
Delegation
tokens
Block tokens
Job tokens
Grid
Data Lifecycle Management and BCP
24
7
Acquisition
Replication
(Feeds)Source
Retention
(Policy based
Expiration)
Archival
(Tape Backup)
DataOut
Data Lifecycle
Datastore
Datastore defines a data
source/target (e.g. HDFS)
Dataset
Defines the data flow of a feed
Workflow
Defines a unit of work carried
out by acquisition, replication,
retention servers for moving
an instance of a feed
Data Lifecycle Management and BCP
25
7
MetaStore
Cluster 1 - Colo 1
HDFS
Cluster 2 – Colo 2
HDFS
Grid Data
Management
Feed Acquisition
MetaStore
Feed datasets as
partitioned external
tables
Growl extracts
schema for backfill
HCatClient.
addPartitions(…)
Mark
LOAD_DONE
HCatClient.
addPartitions(…)
Mark
LOAD_DONE
Partitions are dropped with
(HCatClient.dropPartitions(…)) after
retention expiration with a
drop_partition notification
add_partition
event notification
add_partition
event notification
Acquisition
Archival,
Dataout
Retention
Feed
Replication
Metering, Audit, and Governance
26
8
Starling
FS, Job, Task logs
Cluster 1 Cluster 2 Cluster n...
CF, Region, Action, Query Stats
Cluster 1 Cluster 2 Cluster n...
DB, Tbl., Part., Colmn. Access Stats
...MS 1 MS 2 MS n
GDM
Data Defn., Flow, Feed, Source
F 1 F 2 F n
Log Warehouse
Log Sources
Metering, Audit, and Governance
27
8
Data Discovery and Access
Public
Non-sensitive
Financial
Restricted
$
Governance
Classification
No addn. reqmt.
LMS Integration
Stock Admin
Integration
Approval Flow
Integration with External Systems
28
9
BI, Reporting, Transactional DBs
Hadoop Customers
…
DH
Cloud Messaging
Serving Systems
Monitoring, Tools, Portals
Infrastructure in Transition
Debunking Myths
29
10
Hadoop isn’t enterprise ready
Hadoop isn’t stable, clusters go down
You lose data on HDFS
Data cannot be shared across the org
NameNodes do not scale
Software upgrades are rare✗
Hadoop use cases are limited
I need expensive servers to get more
Hadoop is so dead
I need Apache this vs. that
✗
✗
✗
✗
✗
✗
✗
✗
✗
Thank You
@sumeetksingh
@rajivec
Yahoo Kiosk #D5
We are Hiring!

Contenu connexe

Tendances

[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기NAVER D2
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Why I quit Amazon and Build the Next-gen Streaming System
Why I quit Amazon and Build the Next-gen Streaming SystemWhy I quit Amazon and Build the Next-gen Streaming System
Why I quit Amazon and Build the Next-gen Streaming SystemYingjun Wu
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariDataWorks Summit
 
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021StreamNative
 
Sharding MySQL with Vitess
Sharding MySQL with VitessSharding MySQL with Vitess
Sharding MySQL with VitessHarun KÜÇÜK
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineInfluxData
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits allJulian Hyde
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaDatabricks
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache CalciteJordan Halterman
 

Tendances (20)

[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Cassandra ppt 1
Cassandra ppt 1Cassandra ppt 1
Cassandra ppt 1
 
Why I quit Amazon and Build the Next-gen Streaming System
Why I quit Amazon and Build the Next-gen Streaming SystemWhy I quit Amazon and Build the Next-gen Streaming System
Why I quit Amazon and Build the Next-gen Streaming System
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with Ambari
 
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
 
Sharding MySQL with Vitess
Sharding MySQL with VitessSharding MySQL with Vitess
Sharding MySQL with Vitess
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage Engine
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks Delta
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 

Similaire à What it takes to run Hadoop at Scale: Yahoo! Perspectives

Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successDataWorks Summit
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataPentaho
 
Costing your Bug Data Operations
Costing your Bug Data OperationsCosting your Bug Data Operations
Costing your Bug Data OperationsDataWorks Summit
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Sumeet Singh
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRABhadra Gowdra
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierDemai Ni
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopHazelcast
 

Similaire à What it takes to run Hadoop at Scale: Yahoo! Perspectives (20)

Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big Data
 
Resume - Narasimha Rao B V (TCS)
Resume - Narasimha  Rao B V (TCS)Resume - Narasimha  Rao B V (TCS)
Resume - Narasimha Rao B V (TCS)
 
Costing your Bug Data Operations
Costing your Bug Data OperationsCosting your Bug Data Operations
Costing your Bug Data Operations
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
BigData_Krishna Kumar Sharma
BigData_Krishna Kumar SharmaBigData_Krishna Kumar Sharma
BigData_Krishna Kumar Sharma
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management Frontier
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 

Dernier (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

What it takes to run Hadoop at Scale: Yahoo! Perspectives

  • 1. W h a t i t Ta k e s t o R u n H a d o o p a t S c a l e : Ya h o o P e r s p e c t i v e s P R E S E N T E D B Y S u m e e t S i n g h , R a j i v C h i t t a j a l l u ⎪ J u n e 1 1 , 2 0 1 5 H a d o o p S u m m i t 2 0 1 5 , S a n J o s e
  • 2. Introduction 2  Senior Engineer with the Hadoop Operations team at Yahoo  Involved with Hadoop since 2006, starting with the early 400-node to over 42,000-node prod env. today  Started with Center for Development of Advanced Computing in 2002 before joining Yahoo in  BS degree in Computer Science from Osmania University, India Rajiv Chittajallu Sr. Principle Engineer Hadoop Operations 701 First Avenue, Sunnyvale, CA 94089 USA @rajivec  Manages Cloud Storage and Big Data products team at Yahoo  Responsible for Product Management, Strategy and Customer Engagements  Managed Cloud Engineering products teams and headed Strategy functions for the Cloud Platform Group at Yahoo  MBA from UCLA and MS from RPI Sumeet Singh Sr. Director, Product Management Cloud Storage and Big Data Platforms 701 First Avenue, Sunnyvale, CA 94089 USA @sumeetksingh
  • 3. Hadoop a Secure Shared Hosted Multi-tenant Platform 3 TV PC Phone Tablet Pushed Data Pulled Data Web Crawl Social Email 3rd Party Content Data Highway Hadoop Grid BI, Reporting, Adhoc Analytics Data Content Ads No-SQL Serving Stores Serving
  • 4. Platform Evolution (2006 – 2015) 4 0 100 200 300 400 500 600 700 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 RawHDFS(inPB) #Servers Year Servers Storage Yahoo! Commits to Scaling Hadoop for Production Use Research Workloads in Search and Advertising Production (Modeling) with machine learning & WebMap Revenue Systems with Security, Multi- tenancy, and SLAs Open Sourced with Apache Hortonworks Spinoff for Enterprise hardening Nextgen Hadoop (H 0.23 YARN) New Services (HBase, Storm, Spark, Hive) Increased User-base with partitioned namespaces Apache H2.6 (Scalable ML, Latency, Utilization, Productivity) Server s Use Cases Hadoo p 43,000 300 HBase 3,000 70 Storm 2,000 50
  • 5. Top 10 Considerations for Scaling a Hadoop-based Platform 5 On-Premise or Public Cloud Total Cost Of Ownership (TCO) Hardware Configuration 2 3 Network4 Software Stack5 6 7 8 10 Security and Account Management Data Lifecycle Management and BCP Metering, Audit and Governance 9 Integration with External Systems Debunking Myths 1
  • 6. On-Premise or Public Cloud – Deployment Models 6 1 Private (dedicated) Clusters Hosted Multi-tenant (private cloud) Clusters Hosted Compute Clusters  Large demanding use cases  New technology not yet platformized  Data movement and regulation issues  When more cost effective than on- premise  Time to market/ results matter  Data already in public cloud  Source of truth for all of orgs data  App delivery agility  Operational efficiency and cost savings through economies of scale On-Premise Public Cloud Purpose-built Big Data Clusters  For performance, tighter integration with tech stack  Value added services such as monitoring, alerts, tuning and common tools
  • 7. On-Premise or Public Cloud – Selection Criteria 7 1  Fixed, does not vary with utilization  Favors scale and 24x7 centralized ops  Variable with usage  Favors run and done, decentralized ops Cost  Aggregated from disparate or distributed sources  Typically generated and stored in the cloud Data  Job queue, cap. sched., BCP, catchup  Controlled latency and throughput  No guarantees (beyond uptime) without provisioning additional resources SLA  Control over deployed technology  Requires platform team/ vendor support  Little to no control over tech stack  No need for platform R&D headcount Tech Stack  Shared env., control over data /movement, PII, ACLs, pluggable security  Data typically not shared among users in the cloud Security  Matters, complex to develop and operate  Does not matter, clusters are dynamic/ virtual and dedicated Multi- tenancy On-Premise Public CloudCriteria
  • 8. On-Premise or Public Cloud – Evaluation 8 1 On-Premise Public Cloud Cost Data SLA Tech Stack Security Multi-tenancy
  • 9. On-Premise or Public Cloud – Utilization Matters 9 1 Utilization / Consumption (Compute and Storage) Cost($) On-premise Hadoop as a Service On-demand public cloud service Terms-based public cloud service Favors on-premise Hadoop as a Service Favors public cloud service x x Current and expected or target utilization can provide further insights into your operations and cost competitiveness Highstartingcost Scalingup Crossover point 1
  • 10. Total Cost Of Ownership (TCO) – Components 10 2 $2.1 M 60% 12% 7% 6% 3% 2% 6 5 4 3 2 1 7 10% Operations Engineering  Headcount for service engineering and data operations teams responsible for day-to-day ops and support 6 Acquisition/ Install (One-time)  Labor, POs, transportation, space, support, upgrades, decommissions, shipping/ receiving etc. 5 Network Hardware  Aggregated network component costs, including switches, wiring, terminal servers, power strips etc. 4 Active Use and Operations (Recurring)  Recurring datacenter ops cost (power, space, labor support, and facility maintenance 3 R&D HC  Headcount for platform software development, quality, and release engineering 2 Cluster Hardware  Data nodes, name nodes, job trackers, gateways, load proxies, monitoring, aggregator, and web servers 1 Monthly TCOTCO Components Network Bandwidth  Data transferred into and out of clusters for all colos, including cross-colo transfers 7 ILLUSTRATIVE
  • 11. Total Cost Of Ownership (TCO) – Unit Costs (Hadoop) 11 2 Container memory where apps perform computation and access HDFS if needed Container CPU cores used by apps to perform computation / data processing Network bandwidth needed to move data into/out of the clusters by the app $ / GB-Hour (H 2.0+) GBs of Memory available for an hour Monthly Memory Cost Avail. Memory Capacity $ / vCore-Hour (H 2.6+) vCores of CPU available for an hour Monthly CPU Cost Avail. CPU vCores Unit Total Capacity Unit Cost $ / GB of data stored Usable storage space (less replication and overheads) Monthly Storage Cost Avail. Usable Storage $ / GB for Inter-region data transfers Inter-region (peak) link capacity Monthly BW Cost Monthly GB In + Out Files and directories used by the apps to understand/ limit the load on NN) HFDS (usable) space needed by an app with default replication factor of three
  • 12. Total Cost Of Ownership (TCO) – Consumption Costs 12 2 Map GB-Hours = GB(M1) x T(M1) + GB(M2) x T(M2) + … Reduce GB-Hours = GB(R1) x T(R1) + GB(R2) x T(R2) + … Cost = (M + R) GB-Hour x $0.002 / GB-Hour / Month = $ for the Job/ Month (M+R) GB-Hours for all jobs can summed up for the month for a user, app, BU, or the entire platform Monthly Job and Task Cost Monthly Roll- ups Map vCore-Hours = vCores(M1) x T(M1) + vCores(M2) x T(M2) + … Reduce vCore-Hours = vCores(R1) x T(R1) + vCores(R2) x T(R2) + … Cost = (M + R) vCore-Hour x $0.002 / vCore-Hour / Month = $ for the Job/ Month (M+R) vCore-Hours for all jobs can summed up for the month for a user, app, BU, or the entire platform / project (app) quota in GB (peak monthly used) / user quota in GB (peak monthly used) / data as each user accountable for their portion of use. For e.g. GB Read (U1) GB Read (U1) + GB Read (U2) + … Roll-ups through relationship among user, file ownership, app, and their BU Bandwidth measured at the cluster level and divided among select apps and users of data based on average volume In/Out Roll-ups through relationship among user, app, and their BU
  • 13. Hardware Configuration – Physical Resources 13 3 . . . . Datacenter 1 Rack 1 Rack N . . Clusters in Datacenters Server Resources C-nn / 64,128,256 G / 4000, 6000 etc.
  • 14. Hardware Configuration – Eventual Heterogeneity 14 3 24 G 8 cores SATA 0.5 TB 48 G 12 cores SATA 1.0 TB 64 G Harpertown SATA 2.0 TB 128 G Sandy Bridge SATA 3.0 TB 192 G Ivy Bridge SATA 4.0 TB 256 G Haswell SATA 6.0 TB 384 G  Heterogeneous Configurations: 10s of configs of data nodes (collected over the years) without dictating scheduling decisions – let the framework balance out the configs  Heterogeneous Storage: HDFS supports heterogeneous storage (HDD, SSD, RAM, RAID etc.) – HDFS-2832, HDFS-5682  Heterogeneous Scheduling: operate multiple purpose hardware in the same cluster (e.g. GPUs) – YARN 796
  • 15. Network – Common Backplane 15 4 DataNode NodeManager NameNode RM DataNodes RegionServers NameNode HBase Master Nimbus Supervisor Administration, Management and Monitoring ZooKeeper Pools HTTP/HDFS/GDM Load Proxies Applications and Data Data Feeds Data Stores Oozie Server HS2/ HCat Network Backplane
  • 16. Network – Bottleneck Awareness 16 4 Hadoop Cluster (Data Set 1) Hadoop Cluster (Data Set 2) HBase Cluster (Low-latency Data Store) Storm Cluster (Real-time / Stream Processing) Large dataset joins or data sharing over network 1 Large extractions may saturate the network 2 Fast bulk updates may saturate the network 3 Large data copies may not be possible 4
  • 17. Network – 1/10G BAS (Rack Locality Not A Major Issue) 17 4 RSW … … … N x RSW RSW BAS1-1 BAS1-2 FAB 1 FAB 2 FAB 3 FAB 4 FAB 5 FAB 6 FAB 7 FAB 8 L3 Backplane RSW … … … N x RSW RSW BAS8-1 BAS8-2 L3 Backplane … 1 Gbps 2:1 oversubscription 10 Gbps 8 x 10 Gbps Fabric Layer 48 racks, 15,360 hosts SPOF!
  • 18. Network –10G CLOS (Server Placement Not an Issue) 18 4 Spine 1 Leaf 1 Spine 2 Leaf 2 Leaf 3 Leaf 4 Spine 15 Leaf 29 Leaf 30 Leaf 31 Spine 0 Leaf 0 . . . . . . Virtual Chassis 0 Spine 1 Leaf 1 Spine 2 Leaf 2 Leaf 3 Leaf 4 Spine 15 Leaf 29 Leaf 30 Leaf 31 Spine 0 Leaf 0 . . . . . . Virtual Chassis 1 RSW N x RSW RSW 10 Gbps 5:1 oversubscription 16 spines, 32 leafs 2 x 40 Gbps 512 racks, 20,480 hosts SPOF!
  • 19. Network – Gen Next 19 4 Source: http://www.opencompute.org
  • 20. Software Stack – Where are We Today 20 5 Compute Services Storage Infrastructure Services Hive (0.13) Pig (0.11, 0.14) Oozie (4.4) HDFS Proxy (3.2) GDM (6.4) YARN (2.6) MapReduce (2.6) HDFS (2.6) HBase (0.98) Zookeeper Grid UI (SS/Doppler, Discovery, Hue 3.7) Monitoring Starling Messaging Service HCatalog (0.13) Storm (0.9.2) Spark (1.3) Tez (0.6)
  • 21. Software Stack – Obsess With Use Cases, Not Tech 21 5 HDFS (File System) YARN (Scheduling, Resource Management) Common In- progress, Unmet needs or Apache Alignment Platformized Tech with Production Support RHEL6 64-bit, JDK8
  • 22. Security and Account Management – Overview 22 6 Grid Identity, Authentication and Authorization User Id SSO Groups, Netgroups, Roles RPC (GSSAPI) UI (SPNEGO)
  • 23. Security and Account Management – Flexibly Secure 23 6 Kerb Realm 2 (Users) Kerb Realm 1 (Projects, Services) IdP SP CLIENTS CORP PROD Auth User SSO Netgroups Hadoop RPC Delegation tokens Block tokens Job tokens Grid
  • 24. Data Lifecycle Management and BCP 24 7 Acquisition Replication (Feeds)Source Retention (Policy based Expiration) Archival (Tape Backup) DataOut Data Lifecycle Datastore Datastore defines a data source/target (e.g. HDFS) Dataset Defines the data flow of a feed Workflow Defines a unit of work carried out by acquisition, replication, retention servers for moving an instance of a feed
  • 25. Data Lifecycle Management and BCP 25 7 MetaStore Cluster 1 - Colo 1 HDFS Cluster 2 – Colo 2 HDFS Grid Data Management Feed Acquisition MetaStore Feed datasets as partitioned external tables Growl extracts schema for backfill HCatClient. addPartitions(…) Mark LOAD_DONE HCatClient. addPartitions(…) Mark LOAD_DONE Partitions are dropped with (HCatClient.dropPartitions(…)) after retention expiration with a drop_partition notification add_partition event notification add_partition event notification Acquisition Archival, Dataout Retention Feed Replication
  • 26. Metering, Audit, and Governance 26 8 Starling FS, Job, Task logs Cluster 1 Cluster 2 Cluster n... CF, Region, Action, Query Stats Cluster 1 Cluster 2 Cluster n... DB, Tbl., Part., Colmn. Access Stats ...MS 1 MS 2 MS n GDM Data Defn., Flow, Feed, Source F 1 F 2 F n Log Warehouse Log Sources
  • 27. Metering, Audit, and Governance 27 8 Data Discovery and Access Public Non-sensitive Financial Restricted $ Governance Classification No addn. reqmt. LMS Integration Stock Admin Integration Approval Flow
  • 28. Integration with External Systems 28 9 BI, Reporting, Transactional DBs Hadoop Customers … DH Cloud Messaging Serving Systems Monitoring, Tools, Portals Infrastructure in Transition
  • 29. Debunking Myths 29 10 Hadoop isn’t enterprise ready Hadoop isn’t stable, clusters go down You lose data on HDFS Data cannot be shared across the org NameNodes do not scale Software upgrades are rare✗ Hadoop use cases are limited I need expensive servers to get more Hadoop is so dead I need Apache this vs. that ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗

Notes de l'éditeur

  1. (30 secs) T: 30 secs Let’s get started. I always wanted to document and talk about some of the top considerations that go into making Hadoop a scalable platform for the entire business, and this was a perfect opportunity to do so.
  2. (30 secs) T: 1 min My name is Sumeet Singh, and I am a Sr. Director of Products for Hadoop and Big Data platforms at Yahoo. I have been working at Yahoo for nearly three and a half / four years now, and had few different roles in my time there. The top 10 list I am presenting is what I came up with. You may come up with a different list, or think of other things that may be more important for your businesses. Let me set some context first on Hadoop at Yahoo, and then we will jump right into it.
  3. (1 min) T: 2 min Hadoop is a secure shared platform at Yahoo, lot of people call it the public grid, but it’s a private cloud. Yahoo products and properties across devices generate a lot of data that is of immense value to us in driving new and interesting use experiences across devices. All that data comes to Hadoop that acts as a single source of truth for all data at Yahoo. A wide variety of other data also gets pulled into the Hadoop Grid from various sources as shown. The idea is to consolidated data from all over was the company from disparate sources at once place so that it can be (a) shared (b) enriched (c) de-duped (d) kept up to date. That data once processed is applied back or served as value to our consumers in the form of personalized experiences across our products and properties. And of course is used for reporting and analytics. All of this is done while keeping the web scale economics and cost in mind.
  4. (30 secs) T: 2 min 30 secs With improving platform capabilities and robustness, the workload over the years have systematically moved from research into production and revenue systems. The use cases running on platform have continued to grow and we reached a point quickly where we started to depend on it. The growth is evident from the storage we have added at roughly 20-25% CAGR in the last five years.
  5. (1 min) T: 3 min 30 secs Ok, so how did we get here? In the rest of the talk, I want to walk you through some of the top considerations that go into making the Hadoop a scalable platform to run your entire business on. First one, of course is where you run and how you should think about it (2) Second is the cost considerations and knowing who and what costs how much (3) Third, of course is how to choose and think about hardware configurations (4) Network is a big piece of the puzzle, and we will talk about that (5) Software stack that you want to run on your platform (6) Security is of course a big area, we will not get too detailed into any one of these (7) Data movement or lifecycle management, including BCP provisions (8) Then, we will get into audit and governance (9) How well Hadoop works with other/ external systems (10) And, finally, talking about some of the issues or non-issues when you make architectural choices on the platform.
  6. (1 min) T: 4 min 30 secs Perhaps the first consideration would be where you want to run your workloads, on an on-premise infrastructure or public cloud. With on-premise, you can go with two different deployment models, (1) private cluster that you would often setup for large demanding use cases, newer technology or for regulatory reasons, (2) and then there is shared grid which is source of truth for most data at the company. Public cloud models also come in two flavors. (1) You acquire hosted compute clusters and run Hadoop on them mostly for setup time or cost reasons or if your data is already generated and stored in public cloud. (2) Purpose-built big data clouds are also becoming popular with more intense out of the box integration among offered stack components
  7. (2 min) T: 6 min 30 secs There is no right or wrong answer when you are thinking about which model to go after. Depends on your particular situation, your workloads, your plans for the future, adjacencies and other investments in cloud etc. Let’s consider some of the aspects that makes you sway one way or the other. (1) Cost – On-premise requires investments in setup and maintenance, public cloud has a convenience aspect to it as well. (2) Data – You need to think about data movement and data serving. One of the most important aspects to consider (3) SLA – There are multiple mechanism to make sure a variety of SLAs are covered, not just uptime (4) Tech Stack – You can deploy what you need at both places, but you have better control over the stack and integration with other aspects of cloud, such as monitoring, serving stack (in our case cloud/ data serving, ad serving, and content serving) (5) Security – one of the deepest topic you need to think about. (6) Security and multi-tenancy is hard. If your plans are to scale up your infrastructure, you need to think about it. Company policies also impacts your software and operations.
  8. (30 secs) T: 7 min Cost and SLAs (which is price and performance) is something you can evaluate quantitatively, but the other factors require careful consideration and choices. A single factor can sway you in one direction or the other.
  9. (1 min) T: 8 min So when does an on-premise infrastructure makes sense. When you amortize your investments over a higher usage of resources. In a nutshell, as you benchmark, you will realize that as utilization/ consumption increase, eventually, you will start to look fairly attractive compared with a public cloud model. Do not stop at the cost, conduct a sensitivity analysis based on your expected/ target utilization before you make a decision.
  10. (2 min) T: 10 min Knowing your true cost of operating a platform is not only important for metering but also giving projects running on the platform a good idea of their ROI and profitability. This is perhaps one of the most difficult exercises as it requires running around to gather data which is often hard to get. Nonetheless, this is the first step and required. We will walk our way up. Hardware – not just the namenodes and datanodes, but all kinds of other things involved. Also, most hadoop servers have compute and jbods, so remember this that is all things hardware. R&D HC – this is not for writing or developing apps on top of hadoop, this is HC needed for platform development Ops Cost – power is one of the biggest elements here. Network – can be substantial depending on the backplane capacity and architectural choices Acq./ Install – one time costs Ops – this is where Yahoo excels. Number of people needed to maintain the infra. Network bandwidth – can be significant depending on your operation. As you understand your total cost of ownership, you can get to a monthly figure. Remember to properly depreciate assets and bring it to a monthly run rate (opex + capex).
  11. (2 min) T: 12 min Once you have your monthly TCO, it is time to understand your unit costs. So, let’s talk about resources you are going to consume first. Compute (Memory) is for the YARN containers where your Map or Reduce tasks run. The unit is $/GB-hr/month. You need to know your monthly compute TCO and total compute capacity Compute (CPU) is for YARN containers in terms of CPU vCores. 10 vCores per hyper threaded core or 20 vCores per physical CPU core. Storage – usable storage space instead of raw storage space. Bandwidth – calculated at the cross-colo link level, but you get your monthly bill based on your portion of data in and out. So, your unit cost is $/(data in+out). Namespace – while we do not cost it out separately for the namenode, it is important to manage it. We have allocations/ quota, simply the sum of all the files and directories.
  12. (2 min) T: 14 min Remember the monthly TCO we calculated earlier. I am splitting my monthly TCO in 30/30/40 among compute memory, compute CPU and storage. You can choose based on your situation. If you buy them separately, then you already know the split. My total slots or memory in the infrastructure is summed up and converted to hours. You need the notion of time as you will associate the units to usage by the users. So, total cost/ total capacity (in slot hours) or in usable storage (good so that you can benchmark effectively should you go to cloud systems that charge for usable storage of data, not raw storage capacity). And similarly for the bandwidth.
  13. (30 secs) T: 15 min There are only four resources that are consumed in Hadoop operations when it comes to hardware. Memory, CPU, Storage/ Disk, and Network. We will talk about networks within datacenter at length as our fourth consideration. The best way to think of these resources as configurable when it comes to deciding the type of server hardware you want to setup for hadoop. So, combining these resources into a configuration you can certify for performance evaluation.
  14. (1 min) T: 16 min Here are some of the possible combinations, as you can see, even today, our clusters have generations of memory and CPU as well as disk drives that are configured as JBODs giving us well over 10 configurations live in production. Beyond recognizing the configuration for overall pool, we let the framework make decisions when managing a cluster with heterogeneous configuration. As far as storage is concerned, Hadoop has support for multiple types of storage, although we are chugging along find with disk drives. We are now also composing clusters with large memory boxes and GPUs particularly for Machine Learning and let the applications choose those configurations for specific needs based on assigned labels to them.
  15. (2 min) T: 18 min Beyond servers, network is another important aspect of the overall setup. We host all shared compute clusters on a dedicated backplane. Within a datacenter, bandwidth between all compute racks is consistent, independent of the clusters. This allows for inter-cluster access with similar transfer rates as intra- cluster racks. Nodes (racks) can be moved between clusters without network reconfiguration or physical moves.
  16. (1 min) T: 19 min Big Data Applications can be demanding on the network as they often present many-to-one traffic flows, otherwise known as in-cast. For e.g. you may be joining or accessing data between two Hadoop clusters. Large data extractions between HBase and Hadoop is common for web applications such as in search. Between Storm and HBase where incremental processing works great, fast bulk updates become an issue. Similarly when moving data between Hadoop and storm clusters. We have seen saturations, although mostly TOR switches before overwhelming anything else.
  17. (2 min) T: 21 min One way to compose your network among racks is what we called as through BAS switches a.k.a. big ass switches or big box fabric design. (core – distribution – access layers). Some call it aggregation and or core. The host connect is 1G to TOR. Switch uplink is 10G. Minimum all-to-all guarantee across 15,000+ servers this backplane can support is 500Gpbs with a 2:1 oversubscription. Oversubscription is the ratio of contention should all devices send traffic at the same time. (1G copper – 8way – 200Mbps, 10G Fiber – 2-way – 500Mbps). This typical datacenter design can get inflexible to scale and $/server for 10G may be expensive. Also, given that nodes in Hadoop have IP addresses, you rely on L3 or routing protocols and STP that can again get inflexible and hard.
  18. (2 min) T: 23 min A new leaf and spine based architecture on old circuit switched networks (1950s) is becoming popular now for better latency and $/server cost. Now, the RSW can be 48x10G, with 4x40G uplinks. The 4x40G uplinks can be broken down into 16x10G (upto 160G). The number of VCs you construct defines the uplinks, for e.g., 2 VCs with 2x40G. The composition of the VCs define the TORs you can have. Each leaf here can support 16 TOR switches giving you 20,000+ (= 16 * 32 = 512 == 20,480 hosts) hosts that can be connected. Each leaf supports 16 TORs. So, total TOR Each VC is 40 G, 2 VCs. – 80G. 5:1 == 400G / 80G, to decrease oversubscription build more VCs. Number of leaf is always 2x the number of spines for non-blocking. ½ to spine and ½ to racks. Each TOR has 4 x 10G up.
  19. (1 min) T: 24 min
  20. (1 min) T: 25 min
  21. (1 min) T: 26 min
  22. (1 min) T: 27 min User accounts in LDAP and netgroups for access control. Automated jobs/workflows via headless accounts. Two Kerberos REALM with one-way trusts - Active Directory for individual users and Hadoop specific REALM for headless users and service principals. The architecture is SOX compliant. Kerberos provides Generic Security Service APIs for RPC auth, and Simple and Protected GSSAPI Negotiation Mechanism (SPNEGO) mechanism is used for web UIs for HTTP auth.
  23. (1 min) T: 28 min
  24. (1 min) T: 29 min
  25. (2 min) T: 31 min Each site has a corresponding prod / non-prod cluster for user applications to failover to. Feeds are kept in sync using replication. Applications develop BCP strategy using native Hadoop services (e.g. async replication)
  26. (1 min) T: 32
  27. 1 min T: 33 min
  28. 1 min T: 34 min
  29. 1 min T: 35 mins
  30. (5 mins) T: 40 mins