20090309berkeley

Global Information Platforms
Evolving the Data Warehouse

Jeff Hammerbacher
VP of Products, Cloudera
March 9, 2009

Monday, March 9, 2009

What We’ll Cover Today
Oh Crap, He’s Gonna Ramble

WARNING: highly speculative talk ahead
▪

What did we build at Facebook, and what did it accomplish?
▪

How will that infrastructure evolve further?
▪

There better be some cloud in there.
▪

In the language of this course, we’ll talk mostly about
▪
infrastructure; some thoughts on services and applications at
the end.
Please be skeptical and ask questions throughout
▪


Facebook: Stage Two
You Don’t Even Wanna See Stage One
Scribe Tier MySQL Tier

Data Collection
Server

Oracle Database
Server


Facebook: Stage Four
Shades of an Information Platform
Scribe Tier MySQL Tier

Hadoop Tier

Oracle RAC Servers


Data Points: Single Organization
3 data centers: two on west coast, one on east coast
▪

Around 10K web servers, 1.5K Database servers, 0.5K Memcache
▪

Around 0.7K Hadoop nodes, and growing quickly
▪

Relative data volumes
▪

Around 40 TB in Cassandra tier
▪

Around 60 TB in MySQL tier
▪

Around 1 PB in photos tier
▪

Around 2 PB in Hadoop tier
▪

10 TB per day ingested into Hadoop, 15 TB generated
▪

IMPORTANT: Hadoop tier not retiring data!
▪


Data Points: All Organizations
8 million servers shipped per year (IDC)
▪

20% go to web companies (Rick Rashid)
▪

33% go to HPC (Andy Bechtolsheim)
▪

2.5 exabytes of external storage shipped per year (IDC)
▪

Data center costs (James Hamilton)
▪

45% servers
▪

25% power and cooling hardware
▪

15% power draw
▪

15% network
▪

Jim Gray
▪

“Disks will replace tapes, and disks will have inﬁnite capacity. Period.”
▪

“processors are going to migrate to where the transducers are.”
▪


Information Platform Workloads
Data collection: event logs, persistent storage, web
▪

Regular processing pipelines of varying granularity
▪

Summaries consumed by external users (e.g. Google analytics)
▪

Summaries for internal reporting
▪

Ad optimization pipeline
▪

Experimentation platform pipeline
▪

Ad hoc analyses
▪

Data transformations and data integrity enforcement
▪

Document indexing
▪

Storage system bulk loading
▪

Model building
▪

Reports
▪

Internal storage system workloads: replication, CRC checks, rebalancing, archiving, short stroking
▪


Management Challenges
Stuff I Didn’t Want to Worry About

Pricing: how much should I be paying for my hardware?
▪

Physical management: rack and stack, disk replacement, etc.
▪

Backup/restore, archive, capacity planning
▪

Optimal node conﬁguration: CPU, memory, disk, network
▪

Optimal software conﬁguration
▪

Geographic diversity for data availability and low latency
▪

Access control, encryption, and other security measures
▪

Tiered storage: separation of “hot” and “cold” data
▪


Okay, Let’s Get to the Cloud Stuff
The cloud can help in removing management challenges
▪

Replicate highly valuable data into the cloud
▪

Archive cold data into the cloud
▪

Knit global data centers together with the cloud
▪

See “Watch for Goats in the Cloud” from David Slik of Bycast
▪

http://tr.im/h9LK
▪


Cloud Challenges
Current clouds are not optimized for data intensive workloads
▪

Organizations own signiﬁcant hardware assets
▪

Identity management
▪

Privacy and security
▪

Cloud seeding
▪

Moving data from the customer’s data center to the cloud
▪

Moving data within a mega-datacenter
▪

Moving data between clouds
▪


Bare Metal Cloud (Hosting?) Providers
OpSource: integrated billing
▪

SoftLayer: data center API
▪

3tera: “virtual private data center”
▪

GoGrid Cloud Connect
▪

Rackspace Platform Hosting
▪

The Planet
▪

Liquid Web
▪

Layered Tech
▪

Internap
▪

Terremark Enterprise Cloud
▪


Optimizing Hardware for DISC
We Need Less Power, Captain?

“FAWN: A Fast Array of Wimpy Nodes”
▪

DHT built at CMU with XScale chips and ﬂash storage
▪

“Low Power Amdahl Blades for Data Intensive Computing”
▪

Couple low-power CPUs to ﬂash SSDs for DISC workloads
▪

“Seeding the Clouds”, Dan Reed
▪

Also “Microsoft Builds Atomic Cloud”
▪

Microsoft’s Cloud Computing Futures (CCF) team exploring
▪
clusters built from nodes using low-power Atom chips


Cloud Residue
What Happens to Existing Hardware?

Cloud pricing is not competitive when a company already owns
▪

excess server capacity and employs a signiﬁcant operations team
How can we speed the transition to the cloud?
▪

Consolidate existing secondary market for hardware
▪

purchase from companies with declining pageviews, e.g. MySpace
▪

Two birds, one stone: ship existing servers with initial data load
▪
to cloud provider (“cloud seeding”, see later slide)
Wait it out: servers generally considered to have a three year
▪
lifespan
Where do servers go when they die?
▪


Identity Across Clouds
Conﬁguring your LDAP server to speak to each new cloud utility
▪

is a pain
Authentication and authorization systems being built by every
▪

new cloud provider
Every organization imposes dierent standards on cloud
▪

providers
Consumer identity platforms
▪

Facebook Connect
▪

OpenID + OAuth
▪

I don’t have a good answer here--any thoughts appreciated!
▪


Privacy and Security
Every organization must reinvent and build expertise in these mechanisms
▪

Components
▪

Physical security
▪

Cloud connection: authentication, authorization, encryption
▪

Audit logging
▪

Data obfuscation
▪

Separation from other customers in multi-tenant environment
▪

Segregation of individual users within a customer’s cloud
▪

Storage retirement (disk shredding!)
▪

Controlling access of cloud provider employees
▪

Compliance, certiﬁcation, and legislation
▪

Ramiﬁcations of security breach
▪


Cloud Seeding
Let’s Get This Party Started

Freedom OSS oers AWS-certiﬁed “Cloud Data Transfer Service”
▪

See http://www.freedomoss.com/clouddataingestion
▪

Bycast puts two or more “edge servers” on premise to perform
▪

initial data ingestion, then ships those servers to their cloud data
center
See http://tr.im/h9PH
▪

If you can’t physically ship the disks, leverage Metro Ethernet or
▪

a dedicated link
Investigate modiﬁed network stacks (see following slide)
▪


Bulk Data Transfer Between Data Centers
Companies
▪

WAM!NET (bought by SAVVIS)
▪

Aspera Software
▪

Protocols
▪

GridFTP
▪

UDT
▪

Unix utility: bbcp
▪

Modify congestion control
▪

WAN optimization tricks: compress, transfer deltas, cache, etc.
▪

Peering, transit, OC levels, all that good stu
▪


Data Transfer Within a Data Center
Hierarchical topology
▪

Border routers, core switches, and top of rack switches
▪

Top of rack switches usually oversubscribed
▪

Diversity of protocols
▪

Ethernet, Inﬁniband, Fibre Channel, PCI Express, etc.
▪

Networking companies working to ﬂatten topology and unify
▪
protocols
Cisco: Data Center Ethernet (DCE)
▪

Juniper: Stratus, a.k.a. Data Center Fabric (DCF)
▪

MapReduce architected to push computation to the data; will
▪

such logic be necessary in the near future?


Data Transfer Between Clouds
Most cloud providers present novel APIs for data retrieval
▪

e.g. S3, SimpleDB, App Engine data store, etc.
▪

It’s usually cheaper (or free) to transfer data within a cloud
▪

Standards and organizations are emerging
▪

Open Virtualization Format (OVF)
▪

Open Cloud Consortium (OCC)
▪

Cloud Computing Interoperability Forum (CCIF)
▪

“Uniﬁed Cloud Interface” (UCI)
▪

Their diagrams scare me, a little
▪


Service and Application Changes
“Pay as you go” is shared motto of Dataspaces and the Cloud
▪

Not a coincidence
▪

Persisting data into information platform should be trivial
▪

Layer storage and processing capabilities onto platform
▪

Catalog
▪

Search
▪

Query
▪

Statistics and Machine Learning
▪

Materialize data into storage system best suited to workload
▪

Leverage workload metadata to get better over time
▪


Future Stages
Potential Evolutions, pt. 1

Global snapshots of the distributed ﬁle system
▪

Tiered storage to accommodate “cold” data
▪

Streaming computations over live data
▪

Higher-level libraries for text mining, linear algebra, etc.
▪

Tighter coupling between data collection, job scheduling, and
▪

reporting via a single metadata repository
Testing and debugging frameworks
▪

Proliferation of data marts/sandboxes
▪

Accommodate compute-intensive workloads
▪


Future Stages

Seamless collection of data sets from the web
▪

Wider variety of physical operators (cf. System R* through Dryad)
▪

Separate access APIs for dierent classes of users
▪

Infrastructure engineers
▪

Product engineers
▪

Data scientists
▪

Business analysts
▪

DSLs for domain-speciﬁc work
▪

Utilize browser as client (AJAX, Comet, Gears, etc.)
▪


Future Stages

Workﬂow cloning
▪

Recommended analyses based on workload and user metadata
▪

Automatic keyword search
▪

Integrity constraint checking and enforcement
▪

Granular access controls
▪

Metadata evolution history
▪

Table statistics and Hive query optimization
▪

Utilization optimization regularized by customer satisfaction
▪

Currency-based scheduling (cf. Thomas Sandholm’s work)
▪


Random Set of References
For a more complete bibliography, just ask
▪

“The Cost of a Cloud”
▪

“Above the Clouds”
▪

“A Conversation with Jim Gray”
▪

“Rules of Thumb in Data Engineering”
▪

“Distributed Computing Economics”
▪

“From Databases to Dataspaces”
▪

Dryad and SPC papers
▪


20090309berkeley

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (8)

Similaire à 20090309berkeley

Similaire à 20090309berkeley (20)

Plus de Jeff Hammerbacher

Plus de Jeff Hammerbacher (20)

Dernier

Dernier (20)

20090309berkeley