2. Global Information Platforms
Evolving the Data Warehouse
Jeff Hammerbacher
VP of Products, Cloudera
March 9, 2009
Monday, March 9, 2009
3. What We’ll Cover Today
Oh Crap, He’s Gonna Ramble
WARNING: highly speculative talk ahead
▪
What did we build at Facebook, and what did it accomplish?
▪
How will that infrastructure evolve further?
▪
There better be some cloud in there.
▪
In the language of this course, we’ll talk mostly about
▪
infrastructure; some thoughts on services and applications at
the end.
Please be skeptical and ask questions throughout
▪
Monday, March 9, 2009
4. Facebook: Stage Two
You Don’t Even Wanna See Stage One
Scribe Tier MySQL Tier
Data Collection
Server
Oracle Database
Server
Monday, March 9, 2009
5. Facebook: Stage Four
Shades of an Information Platform
Scribe Tier MySQL Tier
Hadoop Tier
Oracle RAC Servers
Monday, March 9, 2009
6. Data Points: Single Organization
3 data centers: two on west coast, one on east coast
▪
Around 10K web servers, 1.5K Database servers, 0.5K Memcache
▪
Around 0.7K Hadoop nodes, and growing quickly
▪
Relative data volumes
▪
Around 40 TB in Cassandra tier
▪
Around 60 TB in MySQL tier
▪
Around 1 PB in photos tier
▪
Around 2 PB in Hadoop tier
▪
10 TB per day ingested into Hadoop, 15 TB generated
▪
IMPORTANT: Hadoop tier not retiring data!
▪
Monday, March 9, 2009
7. Data Points: All Organizations
8 million servers shipped per year (IDC)
▪
20% go to web companies (Rick Rashid)
▪
33% go to HPC (Andy Bechtolsheim)
▪
2.5 exabytes of external storage shipped per year (IDC)
▪
Data center costs (James Hamilton)
▪
45% servers
▪
25% power and cooling hardware
▪
15% power draw
▪
15% network
▪
Jim Gray
▪
“Disks will replace tapes, and disks will have infinite capacity. Period.”
▪
“processors are going to migrate to where the transducers are.”
▪
Monday, March 9, 2009
8. Information Platform Workloads
Data collection: event logs, persistent storage, web
▪
Regular processing pipelines of varying granularity
▪
Summaries consumed by external users (e.g. Google analytics)
▪
Summaries for internal reporting
▪
Ad optimization pipeline
▪
Experimentation platform pipeline
▪
Ad hoc analyses
▪
Data transformations and data integrity enforcement
▪
Document indexing
▪
Storage system bulk loading
▪
Model building
▪
Reports
▪
Internal storage system workloads: replication, CRC checks, rebalancing, archiving, short stroking
▪
Monday, March 9, 2009
9. Management Challenges
Stuff I Didn’t Want to Worry About
Pricing: how much should I be paying for my hardware?
▪
Physical management: rack and stack, disk replacement, etc.
▪
Backup/restore, archive, capacity planning
▪
Optimal node configuration: CPU, memory, disk, network
▪
Optimal software configuration
▪
Geographic diversity for data availability and low latency
▪
Access control, encryption, and other security measures
▪
Tiered storage: separation of “hot” and “cold” data
▪
Monday, March 9, 2009
10. Okay, Let’s Get to the Cloud Stuff
The cloud can help in removing management challenges
▪
Replicate highly valuable data into the cloud
▪
Archive cold data into the cloud
▪
Knit global data centers together with the cloud
▪
See “Watch for Goats in the Cloud” from David Slik of Bycast
▪
http://tr.im/h9LK
▪
Monday, March 9, 2009
11. Cloud Challenges
Current clouds are not optimized for data intensive workloads
▪
Organizations own significant hardware assets
▪
Identity management
▪
Privacy and security
▪
Cloud seeding
▪
Moving data from the customer’s data center to the cloud
▪
Moving data within a mega-datacenter
▪
Moving data between clouds
▪
Monday, March 9, 2009
12. Bare Metal Cloud (Hosting?) Providers
OpSource: integrated billing
▪
SoftLayer: data center API
▪
3tera: “virtual private data center”
▪
GoGrid Cloud Connect
▪
Rackspace Platform Hosting
▪
The Planet
▪
Liquid Web
▪
Layered Tech
▪
Internap
▪
Terremark Enterprise Cloud
▪
Monday, March 9, 2009
13. Optimizing Hardware for DISC
We Need Less Power, Captain?
“FAWN: A Fast Array of Wimpy Nodes”
▪
DHT built at CMU with XScale chips and flash storage
▪
“Low Power Amdahl Blades for Data Intensive Computing”
▪
Couple low-power CPUs to flash SSDs for DISC workloads
▪
“Seeding the Clouds”, Dan Reed
▪
Also “Microsoft Builds Atomic Cloud”
▪
Microsoft’s Cloud Computing Futures (CCF) team exploring
▪
clusters built from nodes using low-power Atom chips
Monday, March 9, 2009
14. Cloud Residue
What Happens to Existing Hardware?
Cloud pricing is not competitive when a company already owns
▪
excess server capacity and employs a significant operations team
How can we speed the transition to the cloud?
▪
Consolidate existing secondary market for hardware
▪
purchase from companies with declining pageviews, e.g. MySpace
▪
Two birds, one stone: ship existing servers with initial data load
▪
to cloud provider (“cloud seeding”, see later slide)
Wait it out: servers generally considered to have a three year
▪
lifespan
Where do servers go when they die?
▪
Monday, March 9, 2009
15. Identity Across Clouds
Configuring your LDAP server to speak to each new cloud utility
▪
is a pain
Authentication and authorization systems being built by every
▪
new cloud provider
Every organization imposes dierent standards on cloud
▪
providers
Consumer identity platforms
▪
Facebook Connect
▪
OpenID + OAuth
▪
I don’t have a good answer here--any thoughts appreciated!
▪
Monday, March 9, 2009
16. Privacy and Security
Every organization must reinvent and build expertise in these mechanisms
▪
Components
▪
Physical security
▪
Cloud connection: authentication, authorization, encryption
▪
Audit logging
▪
Data obfuscation
▪
Separation from other customers in multi-tenant environment
▪
Segregation of individual users within a customer’s cloud
▪
Storage retirement (disk shredding!)
▪
Controlling access of cloud provider employees
▪
Compliance, certification, and legislation
▪
Ramifications of security breach
▪
Monday, March 9, 2009
17. Cloud Seeding
Let’s Get This Party Started
Freedom OSS oers AWS-certified “Cloud Data Transfer Service”
▪
See http://www.freedomoss.com/clouddataingestion
▪
Bycast puts two or more “edge servers” on premise to perform
▪
initial data ingestion, then ships those servers to their cloud data
center
See http://tr.im/h9PH
▪
If you can’t physically ship the disks, leverage Metro Ethernet or
▪
a dedicated link
Investigate modified network stacks (see following slide)
▪
Monday, March 9, 2009
18. Bulk Data Transfer Between Data Centers
Companies
▪
WAM!NET (bought by SAVVIS)
▪
Aspera Software
▪
Protocols
▪
GridFTP
▪
UDT
▪
Unix utility: bbcp
▪
Modify congestion control
▪
WAN optimization tricks: compress, transfer deltas, cache, etc.
▪
Peering, transit, OC levels, all that good stu
▪
Monday, March 9, 2009
19. Data Transfer Within a Data Center
Hierarchical topology
▪
Border routers, core switches, and top of rack switches
▪
Top of rack switches usually oversubscribed
▪
Diversity of protocols
▪
Ethernet, Infiniband, Fibre Channel, PCI Express, etc.
▪
Networking companies working to flatten topology and unify
▪
protocols
Cisco: Data Center Ethernet (DCE)
▪
Juniper: Stratus, a.k.a. Data Center Fabric (DCF)
▪
MapReduce architected to push computation to the data; will
▪
such logic be necessary in the near future?
Monday, March 9, 2009
20. Data Transfer Between Clouds
Most cloud providers present novel APIs for data retrieval
▪
e.g. S3, SimpleDB, App Engine data store, etc.
▪
It’s usually cheaper (or free) to transfer data within a cloud
▪
Standards and organizations are emerging
▪
Open Virtualization Format (OVF)
▪
Open Cloud Consortium (OCC)
▪
Cloud Computing Interoperability Forum (CCIF)
▪
“Unified Cloud Interface” (UCI)
▪
Their diagrams scare me, a little
▪
Monday, March 9, 2009
21. Service and Application Changes
“Pay as you go” is shared motto of Dataspaces and the Cloud
▪
Not a coincidence
▪
Persisting data into information platform should be trivial
▪
Layer storage and processing capabilities onto platform
▪
Catalog
▪
Search
▪
Query
▪
Statistics and Machine Learning
▪
Materialize data into storage system best suited to workload
▪
Leverage workload metadata to get better over time
▪
Monday, March 9, 2009
22. Future Stages
Potential Evolutions, pt. 1
Global snapshots of the distributed file system
▪
Tiered storage to accommodate “cold” data
▪
Streaming computations over live data
▪
Higher-level libraries for text mining, linear algebra, etc.
▪
Tighter coupling between data collection, job scheduling, and
▪
reporting via a single metadata repository
Testing and debugging frameworks
▪
Proliferation of data marts/sandboxes
▪
Accommodate compute-intensive workloads
▪
Monday, March 9, 2009
23. Future Stages
Potential Evolutions, pt. 2
Seamless collection of data sets from the web
▪
Wider variety of physical operators (cf. System R* through Dryad)
▪
Separate access APIs for dierent classes of users
▪
Infrastructure engineers
▪
Product engineers
▪
Data scientists
▪
Business analysts
▪
DSLs for domain-specific work
▪
Utilize browser as client (AJAX, Comet, Gears, etc.)
▪
Monday, March 9, 2009
24. Future Stages
Potential Evolutions, pt. 3
Workflow cloning
▪
Recommended analyses based on workload and user metadata
▪
Automatic keyword search
▪
Integrity constraint checking and enforcement
▪
Granular access controls
▪
Metadata evolution history
▪
Table statistics and Hive query optimization
▪
Utilization optimization regularized by customer satisfaction
▪
Currency-based scheduling (cf. Thomas Sandholm’s work)
▪
Monday, March 9, 2009
25. Random Set of References
For a more complete bibliography, just ask
▪
“The Cost of a Cloud”
▪
“Above the Clouds”
▪
“A Conversation with Jim Gray”
▪
“Rules of Thumb in Data Engineering”
▪
“Distributed Computing Economics”
▪
“From Databases to Dataspaces”
▪
Dryad and SPC papers
▪
Monday, March 9, 2009
26. (c) 2009 Cloudera, Inc. or its licensors. quot;Clouderaquot; is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0
Monday, March 9, 2009