SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Monday, March 9, 2009
Global Information Platforms
         Evolving the Data Warehouse


         Jeff Hammerbacher
         VP of Products, Cloudera
         March 9, 2009



Monday, March 9, 2009
What We’ll Cover Today
         Oh Crap, He’s Gonna Ramble

             WARNING: highly speculative talk ahead
         ▪




             What did we build at Facebook, and what did it accomplish?
         ▪

             How will that infrastructure evolve further?
         ▪

             There better be some cloud in there.
         ▪

             In the language of this course, we’ll talk mostly about
         ▪
             infrastructure; some thoughts on services and applications at
             the end.
             Please be skeptical and ask questions throughout
         ▪




Monday, March 9, 2009
Facebook: Stage Two
                        You Don’t Even Wanna See Stage One
                                  Scribe Tier                     MySQL Tier




                                                Data Collection
                                                    Server




                                                Oracle Database
                                                     Server




Monday, March 9, 2009
Facebook: Stage Four
                        Shades of an Information Platform
                                       Scribe Tier     MySQL Tier




                               Hadoop Tier




                                  Oracle RAC Servers




Monday, March 9, 2009
Data Points: Single Organization
             3 data centers: two on west coast, one on east coast
         ▪

             Around 10K web servers, 1.5K Database servers, 0.5K Memcache
         ▪

             Around 0.7K Hadoop nodes, and growing quickly
         ▪

             Relative data volumes
         ▪

                 Around 40 TB in Cassandra tier
             ▪

                 Around 60 TB in MySQL tier
             ▪

                 Around 1 PB in photos tier
             ▪

                 Around 2 PB in Hadoop tier
             ▪

             10 TB per day ingested into Hadoop, 15 TB generated
         ▪

             IMPORTANT: Hadoop tier not retiring data!
         ▪




Monday, March 9, 2009
Data Points: All Organizations
             8 million servers shipped per year (IDC)
         ▪


                 20% go to web companies (Rick Rashid)
             ▪

                 33% go to HPC (Andy Bechtolsheim)
             ▪


             2.5 exabytes of external storage shipped per year (IDC)
         ▪


             Data center costs (James Hamilton)
         ▪


                 45% servers
             ▪

                 25% power and cooling hardware
             ▪

                 15% power draw
             ▪

                 15% network
             ▪


             Jim Gray
         ▪


                 “Disks will replace tapes, and disks will have infinite capacity. Period.”
             ▪

                 “processors are going to migrate to where the transducers are.”
             ▪




Monday, March 9, 2009
Information Platform Workloads
             Data collection: event logs, persistent storage, web
         ▪


             Regular processing pipelines of varying granularity
         ▪


                 Summaries consumed by external users (e.g. Google analytics)
             ▪


                 Summaries for internal reporting
             ▪


                 Ad optimization pipeline
             ▪


                 Experimentation platform pipeline
             ▪


             Ad hoc analyses
         ▪


             Data transformations and data integrity enforcement
         ▪


             Document indexing
         ▪


             Storage system bulk loading
         ▪


             Model building
         ▪


             Reports
         ▪


             Internal storage system workloads: replication, CRC checks, rebalancing, archiving, short stroking
         ▪




Monday, March 9, 2009
Management Challenges
         Stuff I Didn’t Want to Worry About

             Pricing: how much should I be paying for my hardware?
         ▪

             Physical management: rack and stack, disk replacement, etc.
         ▪

             Backup/restore, archive, capacity planning
         ▪

             Optimal node configuration: CPU, memory, disk, network
         ▪

             Optimal software configuration
         ▪

             Geographic diversity for data availability and low latency
         ▪

             Access control, encryption, and other security measures
         ▪

             Tiered storage: separation of “hot” and “cold” data
         ▪




Monday, March 9, 2009
Okay, Let’s Get to the Cloud Stuff
             The cloud can help in removing management challenges
         ▪

                 Replicate highly valuable data into the cloud
             ▪

                 Archive cold data into the cloud
             ▪

                 Knit global data centers together with the cloud
             ▪




             See “Watch for Goats in the Cloud” from David Slik of Bycast
         ▪

                 http://tr.im/h9LK
             ▪




Monday, March 9, 2009
Cloud Challenges
             Current clouds are not optimized for data intensive workloads
         ▪

             Organizations own significant hardware assets
         ▪

             Identity management
         ▪

             Privacy and security
         ▪

             Cloud seeding
         ▪

             Moving data from the customer’s data center to the cloud
         ▪

             Moving data within a mega-datacenter
         ▪

             Moving data between clouds
         ▪




Monday, March 9, 2009
Bare Metal Cloud (Hosting?) Providers
             OpSource: integrated billing
         ▪

             SoftLayer: data center API
         ▪

             3tera: “virtual private data center”
         ▪

             GoGrid Cloud Connect
         ▪

             Rackspace Platform Hosting
         ▪

             The Planet
         ▪

             Liquid Web
         ▪

             Layered Tech
         ▪

             Internap
         ▪

             Terremark Enterprise Cloud
         ▪



Monday, March 9, 2009
Optimizing Hardware for DISC
         We Need Less Power, Captain?

             “FAWN: A Fast Array of Wimpy Nodes”
         ▪

                 DHT built at CMU with XScale chips and flash storage
             ▪

             “Low Power Amdahl Blades for Data Intensive Computing”
         ▪

                 Couple low-power CPUs to flash SSDs for DISC workloads
             ▪

             “Seeding the Clouds”, Dan Reed
         ▪

                 Also “Microsoft Builds Atomic Cloud”
             ▪

                 Microsoft’s Cloud Computing Futures (CCF) team exploring
             ▪
                 clusters built from nodes using low-power Atom chips



Monday, March 9, 2009
Cloud Residue
         What Happens to Existing Hardware?

             Cloud pricing is not competitive when a company already owns
         ▪

             excess server capacity and employs a significant operations team
             How can we speed the transition to the cloud?
         ▪

                 Consolidate existing secondary market for hardware
             ▪

                     purchase from companies with declining pageviews, e.g. MySpace
                 ▪


                 Two birds, one stone: ship existing servers with initial data load
             ▪
                 to cloud provider (“cloud seeding”, see later slide)
                 Wait it out: servers generally considered to have a three year
             ▪
                 lifespan
             Where do servers go when they die?
         ▪




Monday, March 9, 2009
Identity Across Clouds
             Configuring your LDAP server to speak to each new cloud utility
         ▪

             is a pain
             Authentication and authorization systems being built by every
         ▪

             new cloud provider
             Every organization imposes dierent standards on cloud
         ▪

             providers
             Consumer identity platforms
         ▪

                 Facebook Connect
             ▪

                 OpenID + OAuth
             ▪

             I don’t have a good answer here--any thoughts appreciated!
         ▪




Monday, March 9, 2009
Privacy and Security
             Every organization must reinvent and build expertise in these mechanisms
         ▪


             Components
         ▪


                 Physical security
             ▪

                 Cloud connection: authentication, authorization, encryption
             ▪

                 Audit logging
             ▪

                 Data obfuscation
             ▪

                 Separation from other customers in multi-tenant environment
             ▪

                 Segregation of individual users within a customer’s cloud
             ▪

                 Storage retirement (disk shredding!)
             ▪

                 Controlling access of cloud provider employees
             ▪

                 Compliance, certification, and legislation
             ▪


             Ramifications of security breach
         ▪



Monday, March 9, 2009
Cloud Seeding
         Let’s Get This Party Started

             Freedom OSS oers AWS-certified “Cloud Data Transfer Service”
         ▪

                 See http://www.freedomoss.com/clouddataingestion
             ▪

             Bycast puts two or more “edge servers” on premise to perform
         ▪

             initial data ingestion, then ships those servers to their cloud data
             center
                 See http://tr.im/h9PH
             ▪

             If you can’t physically ship the disks, leverage Metro Ethernet or
         ▪

             a dedicated link
             Investigate modified network stacks (see following slide)
         ▪




Monday, March 9, 2009
Bulk Data Transfer Between Data Centers
             Companies
         ▪

                 WAM!NET (bought by SAVVIS)
             ▪

                 Aspera Software
             ▪

             Protocols
         ▪

                 GridFTP
             ▪

                 UDT
             ▪

             Unix utility: bbcp
         ▪

             Modify congestion control
         ▪

             WAN optimization tricks: compress, transfer deltas, cache, etc.
         ▪

             Peering, transit, OC levels, all that good stu
         ▪




Monday, March 9, 2009
Data Transfer Within a Data Center
             Hierarchical topology
         ▪

                 Border routers, core switches, and top of rack switches
             ▪

                 Top of rack switches usually oversubscribed
             ▪

             Diversity of protocols
         ▪

                 Ethernet, Infiniband, Fibre Channel, PCI Express, etc.
             ▪

             Networking companies working to flatten topology and unify
         ▪
             protocols
                 Cisco: Data Center Ethernet (DCE)
             ▪

                 Juniper: Stratus, a.k.a. Data Center Fabric (DCF)
             ▪

             MapReduce architected to push computation to the data; will
         ▪

             such logic be necessary in the near future?

Monday, March 9, 2009
Data Transfer Between Clouds
             Most cloud providers present novel APIs for data retrieval
         ▪

                 e.g. S3, SimpleDB, App Engine data store, etc.
             ▪

             It’s usually cheaper (or free) to transfer data within a cloud
         ▪

             Standards and organizations are emerging
         ▪

                 Open Virtualization Format (OVF)
             ▪

                 Open Cloud Consortium (OCC)
             ▪

                 Cloud Computing Interoperability Forum (CCIF)
             ▪

                     “Unified Cloud Interface” (UCI)
                 ▪


                     Their diagrams scare me, a little
                 ▪




Monday, March 9, 2009
Service and Application Changes
             “Pay as you go” is shared motto of Dataspaces and the Cloud
         ▪

                 Not a coincidence
             ▪

             Persisting data into information platform should be trivial
         ▪

             Layer storage and processing capabilities onto platform
         ▪

                 Catalog
             ▪

                 Search
             ▪

                 Query
             ▪

                 Statistics and Machine Learning
             ▪

             Materialize data into storage system best suited to workload
         ▪

             Leverage workload metadata to get better over time
         ▪




Monday, March 9, 2009
Future Stages
         Potential Evolutions, pt. 1

             Global snapshots of the distributed file system
         ▪

             Tiered storage to accommodate “cold” data
         ▪

             Streaming computations over live data
         ▪

             Higher-level libraries for text mining, linear algebra, etc.
         ▪

             Tighter coupling between data collection, job scheduling, and
         ▪

             reporting via a single metadata repository
             Testing and debugging frameworks
         ▪

             Proliferation of data marts/sandboxes
         ▪

             Accommodate compute-intensive workloads
         ▪




Monday, March 9, 2009
Future Stages
         Potential Evolutions, pt. 2

             Seamless collection of data sets from the web
         ▪

             Wider variety of physical operators (cf. System R* through Dryad)
         ▪

             Separate access APIs for dierent classes of users
         ▪

                 Infrastructure engineers
             ▪

                 Product engineers
             ▪

                 Data scientists
             ▪

                 Business analysts
             ▪

                 DSLs for domain-specific work
             ▪

             Utilize browser as client (AJAX, Comet, Gears, etc.)
         ▪



Monday, March 9, 2009
Future Stages
         Potential Evolutions, pt. 3

             Workflow cloning
         ▪

             Recommended analyses based on workload and user metadata
         ▪

             Automatic keyword search
         ▪

             Integrity constraint checking and enforcement
         ▪

             Granular access controls
         ▪

             Metadata evolution history
         ▪

             Table statistics and Hive query optimization
         ▪

             Utilization optimization regularized by customer satisfaction
         ▪

             Currency-based scheduling (cf. Thomas Sandholm’s work)
         ▪



Monday, March 9, 2009
Random Set of References
             For a more complete bibliography, just ask
         ▪




             “The Cost of a Cloud”
         ▪

             “Above the Clouds”
         ▪

             “A Conversation with Jim Gray”
         ▪

             “Rules of Thumb in Data Engineering”
         ▪

             “Distributed Computing Economics”
         ▪

             “From Databases to Dataspaces”
         ▪

             Dryad and SPC papers
         ▪




Monday, March 9, 2009
(c) 2009 Cloudera, Inc. or its licensors.  quot;Clouderaquot; is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0




Monday, March 9, 2009

Contenu connexe

Tendances

Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionPhil Cryer
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
Bitrix Site Manager v11.0 Presentation
Bitrix Site Manager v11.0 PresentationBitrix Site Manager v11.0 Presentation
Bitrix Site Manager v11.0 PresentationBitrix, Inc.
 
Apache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesApache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesDataWorks Summit
 
What Your CDN Won't Tell You: Optimizing a News Website for Speed and Stability
What Your CDN Won't Tell You: Optimizing a News Website for Speed and StabilityWhat Your CDN Won't Tell You: Optimizing a News Website for Speed and Stability
What Your CDN Won't Tell You: Optimizing a News Website for Speed and StabilityJulian Dunn
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
 
Portrait of the Developer as the Artist - OpenTour Sofia
Portrait of the Developer as the Artist - OpenTour SofiaPortrait of the Developer as the Artist - OpenTour Sofia
Portrait of the Developer as the Artist - OpenTour SofiaPatrick Chanezon
 
10 Do's and Don'ts for MySQL Cluster
10 Do's and Don'ts for MySQL Cluster10 Do's and Don'ts for MySQL Cluster
10 Do's and Don'ts for MySQL Clusterelliando dias
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011GlusterFS
 
Virtualization Primer for Java Developers
Virtualization Primer for Java DevelopersVirtualization Primer for Java Developers
Virtualization Primer for Java DevelopersRichard McDougall
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
 
Presentation introduction to cloud computing and technical issues
Presentation   introduction to cloud computing and technical issuesPresentation   introduction to cloud computing and technical issues
Presentation introduction to cloud computing and technical issuesxKinAnx
 
GlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 MeetupGlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 MeetupGlusterFS
 
Gluster Webinar: Introduction to GlusterFS v3.3
Gluster Webinar: Introduction to GlusterFS v3.3Gluster Webinar: Introduction to GlusterFS v3.3
Gluster Webinar: Introduction to GlusterFS v3.3GlusterFS
 
DbB 10 Webcast #3 The Secrets Of Scalability
DbB 10 Webcast #3   The Secrets Of ScalabilityDbB 10 Webcast #3   The Secrets Of Scalability
DbB 10 Webcast #3 The Secrets Of ScalabilityLaura Hood
 
Hybrid my sql_hadoop_datawarehouse
Hybrid my sql_hadoop_datawarehouseHybrid my sql_hadoop_datawarehouse
Hybrid my sql_hadoop_datawarehouseLaine Campbell
 

Tendances (20)

Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage Solution
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
Bitrix Site Manager v11.0 Presentation
Bitrix Site Manager v11.0 PresentationBitrix Site Manager v11.0 Presentation
Bitrix Site Manager v11.0 Presentation
 
Apache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesApache Hadoop on Virtual Machines
Apache Hadoop on Virtual Machines
 
What Your CDN Won't Tell You: Optimizing a News Website for Speed and Stability
What Your CDN Won't Tell You: Optimizing a News Website for Speed and StabilityWhat Your CDN Won't Tell You: Optimizing a News Website for Speed and Stability
What Your CDN Won't Tell You: Optimizing a News Website for Speed and Stability
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Portrait of the Developer as the Artist - OpenTour Sofia
Portrait of the Developer as the Artist - OpenTour SofiaPortrait of the Developer as the Artist - OpenTour Sofia
Portrait of the Developer as the Artist - OpenTour Sofia
 
10 Do's and Don'ts for MySQL Cluster
10 Do's and Don'ts for MySQL Cluster10 Do's and Don'ts for MySQL Cluster
10 Do's and Don'ts for MySQL Cluster
 
Hadoop & Hep
Hadoop & HepHadoop & Hep
Hadoop & Hep
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011
 
Virtualization Primer for Java Developers
Virtualization Primer for Java DevelopersVirtualization Primer for Java Developers
Virtualization Primer for Java Developers
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
 
Drupal And The Non Profit Agency
Drupal And The Non Profit Agency  Drupal And The Non Profit Agency
Drupal And The Non Profit Agency
 
Presentation introduction to cloud computing and technical issues
Presentation   introduction to cloud computing and technical issuesPresentation   introduction to cloud computing and technical issues
Presentation introduction to cloud computing and technical issues
 
GlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 MeetupGlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 Meetup
 
The Web Scale
The Web ScaleThe Web Scale
The Web Scale
 
Gluster Webinar: Introduction to GlusterFS v3.3
Gluster Webinar: Introduction to GlusterFS v3.3Gluster Webinar: Introduction to GlusterFS v3.3
Gluster Webinar: Introduction to GlusterFS v3.3
 
DbB 10 Webcast #3 The Secrets Of Scalability
DbB 10 Webcast #3   The Secrets Of ScalabilityDbB 10 Webcast #3   The Secrets Of Scalability
DbB 10 Webcast #3 The Secrets Of Scalability
 
Hybrid my sql_hadoop_datawarehouse
Hybrid my sql_hadoop_datawarehouseHybrid my sql_hadoop_datawarehouse
Hybrid my sql_hadoop_datawarehouse
 

En vedette

Hact presentation anna james
Hact presentation  anna jamesHact presentation  anna james
Hact presentation anna jamesHACThousing
 
Online Masterclass Learning Analytics
Online Masterclass Learning Analytics Online Masterclass Learning Analytics
Online Masterclass Learning Analytics Hendrik Drachsler
 
ODI Overview 201306 iCity
ODI Overview 201306 iCityODI Overview 201306 iCity
ODI Overview 201306 iCitytheODI
 
North huyton communities future 2
North huyton communities future 2North huyton communities future 2
North huyton communities future 2HACThousing
 
The Essential Toolkit for Your: EDRM Renovation Australia 2017
The Essential Toolkit for Your: EDRM Renovation Australia 2017  The Essential Toolkit for Your: EDRM Renovation Australia 2017
The Essential Toolkit for Your: EDRM Renovation Australia 2017 Ark Group Australia Pty Ltd
 

En vedette (8)

Sexta evaluacion latini
Sexta evaluacion latiniSexta evaluacion latini
Sexta evaluacion latini
 
Steve Bennett
Steve BennettSteve Bennett
Steve Bennett
 
Hact presentation anna james
Hact presentation  anna jamesHact presentation  anna james
Hact presentation anna james
 
Online Masterclass Learning Analytics
Online Masterclass Learning Analytics Online Masterclass Learning Analytics
Online Masterclass Learning Analytics
 
ODI Overview 201306 iCity
ODI Overview 201306 iCityODI Overview 201306 iCity
ODI Overview 201306 iCity
 
North huyton communities future 2
North huyton communities future 2North huyton communities future 2
North huyton communities future 2
 
The Essential Toolkit for Your: EDRM Renovation Australia 2017
The Essential Toolkit for Your: EDRM Renovation Australia 2017  The Essential Toolkit for Your: EDRM Renovation Australia 2017
The Essential Toolkit for Your: EDRM Renovation Australia 2017
 
рус партнер
рус партнеррус партнер
рус партнер
 

Similaire à 20090309berkeley

John Landry at Mass TLC Feb09
John Landry at Mass TLC Feb09John Landry at Mass TLC Feb09
John Landry at Mass TLC Feb09John Landry
 
Best Practices in Migrating to MySQL - Part 1
Best Practices in Migrating to MySQL - Part 1Best Practices in Migrating to MySQL - Part 1
Best Practices in Migrating to MySQL - Part 1Ronald Bradford
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computingchrismik
 
From Grids To Clouds Guy Tel Zur May 2009
From Grids To Clouds Guy Tel Zur May 2009From Grids To Clouds Guy Tel Zur May 2009
From Grids To Clouds Guy Tel Zur May 2009Guy Tel-Zur
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Challenges Embracing Cloud Storage
Challenges Embracing Cloud StorageChallenges Embracing Cloud Storage
Challenges Embracing Cloud StorageRandy Bias
 
Open Source Cloud Computing: Practical Solutions For Your Online Presence (PDF)
Open Source Cloud Computing: Practical Solutions For Your Online Presence (PDF)Open Source Cloud Computing: Practical Solutions For Your Online Presence (PDF)
Open Source Cloud Computing: Practical Solutions For Your Online Presence (PDF)Todd Deshane
 
Cloud Computing Berkeley.pdf
Cloud Computing Berkeley.pdfCloud Computing Berkeley.pdf
Cloud Computing Berkeley.pdfAtaulAzizIkram
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1Joe Stein
 
A scalable server environment for your applications
A scalable server environment for your applicationsA scalable server environment for your applications
A scalable server environment for your applicationsGigaSpaces
 
Semantic Web Landscape 2009
Semantic Web Landscape 2009Semantic Web Landscape 2009
Semantic Web Landscape 2009LeeFeigenbaum
 
Controlling cloud costs with analytics
Controlling cloud costs with analyticsControlling cloud costs with analytics
Controlling cloud costs with analyticsRightScale
 

Similaire à 20090309berkeley (20)

John Landry at Mass TLC Feb09
John Landry at Mass TLC Feb09John Landry at Mass TLC Feb09
John Landry at Mass TLC Feb09
 
20080611accel
20080611accel20080611accel
20080611accel
 
20081022cca
20081022cca20081022cca
20081022cca
 
20080528dublinpt1
20080528dublinpt120080528dublinpt1
20080528dublinpt1
 
Best Practices in Migrating to MySQL - Part 1
Best Practices in Migrating to MySQL - Part 1Best Practices in Migrating to MySQL - Part 1
Best Practices in Migrating to MySQL - Part 1
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
From Grids To Clouds Guy Tel Zur May 2009
From Grids To Clouds Guy Tel Zur May 2009From Grids To Clouds Guy Tel Zur May 2009
From Grids To Clouds Guy Tel Zur May 2009
 
Cloudy Ajax 08 10
Cloudy Ajax 08 10Cloudy Ajax 08 10
Cloudy Ajax 08 10
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Challenges Embracing Cloud Storage
Challenges Embracing Cloud StorageChallenges Embracing Cloud Storage
Challenges Embracing Cloud Storage
 
Open Source Cloud Computing: Practical Solutions For Your Online Presence (PDF)
Open Source Cloud Computing: Practical Solutions For Your Online Presence (PDF)Open Source Cloud Computing: Practical Solutions For Your Online Presence (PDF)
Open Source Cloud Computing: Practical Solutions For Your Online Presence (PDF)
 
Brandon
BrandonBrandon
Brandon
 
Cloud Computing Berkeley.pdf
Cloud Computing Berkeley.pdfCloud Computing Berkeley.pdf
Cloud Computing Berkeley.pdf
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
 
A scalable server environment for your applications
A scalable server environment for your applicationsA scalable server environment for your applications
A scalable server environment for your applications
 
Semantic Web Landscape 2009
Semantic Web Landscape 2009Semantic Web Landscape 2009
Semantic Web Landscape 2009
 
JOSA TechTalks - Downgrade your Costs
JOSA TechTalks - Downgrade your CostsJOSA TechTalks - Downgrade your Costs
JOSA TechTalks - Downgrade your Costs
 
NATO IST Symposium 2013
NATO IST Symposium 2013NATO IST Symposium 2013
NATO IST Symposium 2013
 
Controlling cloud costs with analytics
Controlling cloud costs with analyticsControlling cloud costs with analytics
Controlling cloud costs with analytics
 

Plus de Jeff Hammerbacher (20)

20120223keystone
20120223keystone20120223keystone
20120223keystone
 
20100714accel
20100714accel20100714accel
20100714accel
 
20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
20100513brown
20100513brown20100513brown
20100513brown
 
20100423sage
20100423sage20100423sage
20100423sage
 
20100418sos
20100418sos20100418sos
20100418sos
 
20100301icde
20100301icde20100301icde
20100301icde
 
20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
 
20100128ebay
20100128ebay20100128ebay
20100128ebay
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091110startup2startup
20091110startup2startup20091110startup2startup
20091110startup2startup
 
20091030nasajpl
20091030nasajpl20091030nasajpl
20091030nasajpl
 
20091027genentech
20091027genentech20091027genentech
20091027genentech
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
 
20090622 Velocity
20090622 Velocity20090622 Velocity
20090622 Velocity
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
20081009nychive
20081009nychive20081009nychive
20081009nychive
 
2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao
 

Dernier

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Dernier (20)

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

20090309berkeley

  • 2. Global Information Platforms Evolving the Data Warehouse Jeff Hammerbacher VP of Products, Cloudera March 9, 2009 Monday, March 9, 2009
  • 3. What We’ll Cover Today Oh Crap, He’s Gonna Ramble WARNING: highly speculative talk ahead ▪ What did we build at Facebook, and what did it accomplish? ▪ How will that infrastructure evolve further? ▪ There better be some cloud in there. ▪ In the language of this course, we’ll talk mostly about ▪ infrastructure; some thoughts on services and applications at the end. Please be skeptical and ask questions throughout ▪ Monday, March 9, 2009
  • 4. Facebook: Stage Two You Don’t Even Wanna See Stage One Scribe Tier MySQL Tier Data Collection Server Oracle Database Server Monday, March 9, 2009
  • 5. Facebook: Stage Four Shades of an Information Platform Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers Monday, March 9, 2009
  • 6. Data Points: Single Organization 3 data centers: two on west coast, one on east coast ▪ Around 10K web servers, 1.5K Database servers, 0.5K Memcache ▪ Around 0.7K Hadoop nodes, and growing quickly ▪ Relative data volumes ▪ Around 40 TB in Cassandra tier ▪ Around 60 TB in MySQL tier ▪ Around 1 PB in photos tier ▪ Around 2 PB in Hadoop tier ▪ 10 TB per day ingested into Hadoop, 15 TB generated ▪ IMPORTANT: Hadoop tier not retiring data! ▪ Monday, March 9, 2009
  • 7. Data Points: All Organizations 8 million servers shipped per year (IDC) ▪ 20% go to web companies (Rick Rashid) ▪ 33% go to HPC (Andy Bechtolsheim) ▪ 2.5 exabytes of external storage shipped per year (IDC) ▪ Data center costs (James Hamilton) ▪ 45% servers ▪ 25% power and cooling hardware ▪ 15% power draw ▪ 15% network ▪ Jim Gray ▪ “Disks will replace tapes, and disks will have infinite capacity. Period.” ▪ “processors are going to migrate to where the transducers are.” ▪ Monday, March 9, 2009
  • 8. Information Platform Workloads Data collection: event logs, persistent storage, web ▪ Regular processing pipelines of varying granularity ▪ Summaries consumed by external users (e.g. Google analytics) ▪ Summaries for internal reporting ▪ Ad optimization pipeline ▪ Experimentation platform pipeline ▪ Ad hoc analyses ▪ Data transformations and data integrity enforcement ▪ Document indexing ▪ Storage system bulk loading ▪ Model building ▪ Reports ▪ Internal storage system workloads: replication, CRC checks, rebalancing, archiving, short stroking ▪ Monday, March 9, 2009
  • 9. Management Challenges Stuff I Didn’t Want to Worry About Pricing: how much should I be paying for my hardware? ▪ Physical management: rack and stack, disk replacement, etc. ▪ Backup/restore, archive, capacity planning ▪ Optimal node configuration: CPU, memory, disk, network ▪ Optimal software configuration ▪ Geographic diversity for data availability and low latency ▪ Access control, encryption, and other security measures ▪ Tiered storage: separation of “hot” and “cold” data ▪ Monday, March 9, 2009
  • 10. Okay, Let’s Get to the Cloud Stuff The cloud can help in removing management challenges ▪ Replicate highly valuable data into the cloud ▪ Archive cold data into the cloud ▪ Knit global data centers together with the cloud ▪ See “Watch for Goats in the Cloud” from David Slik of Bycast ▪ http://tr.im/h9LK ▪ Monday, March 9, 2009
  • 11. Cloud Challenges Current clouds are not optimized for data intensive workloads ▪ Organizations own significant hardware assets ▪ Identity management ▪ Privacy and security ▪ Cloud seeding ▪ Moving data from the customer’s data center to the cloud ▪ Moving data within a mega-datacenter ▪ Moving data between clouds ▪ Monday, March 9, 2009
  • 12. Bare Metal Cloud (Hosting?) Providers OpSource: integrated billing ▪ SoftLayer: data center API ▪ 3tera: “virtual private data center” ▪ GoGrid Cloud Connect ▪ Rackspace Platform Hosting ▪ The Planet ▪ Liquid Web ▪ Layered Tech ▪ Internap ▪ Terremark Enterprise Cloud ▪ Monday, March 9, 2009
  • 13. Optimizing Hardware for DISC We Need Less Power, Captain? “FAWN: A Fast Array of Wimpy Nodes” ▪ DHT built at CMU with XScale chips and flash storage ▪ “Low Power Amdahl Blades for Data Intensive Computing” ▪ Couple low-power CPUs to flash SSDs for DISC workloads ▪ “Seeding the Clouds”, Dan Reed ▪ Also “Microsoft Builds Atomic Cloud” ▪ Microsoft’s Cloud Computing Futures (CCF) team exploring ▪ clusters built from nodes using low-power Atom chips Monday, March 9, 2009
  • 14. Cloud Residue What Happens to Existing Hardware? Cloud pricing is not competitive when a company already owns ▪ excess server capacity and employs a significant operations team How can we speed the transition to the cloud? ▪ Consolidate existing secondary market for hardware ▪ purchase from companies with declining pageviews, e.g. MySpace ▪ Two birds, one stone: ship existing servers with initial data load ▪ to cloud provider (“cloud seeding”, see later slide) Wait it out: servers generally considered to have a three year ▪ lifespan Where do servers go when they die? ▪ Monday, March 9, 2009
  • 15. Identity Across Clouds Configuring your LDAP server to speak to each new cloud utility ▪ is a pain Authentication and authorization systems being built by every ▪ new cloud provider Every organization imposes dierent standards on cloud ▪ providers Consumer identity platforms ▪ Facebook Connect ▪ OpenID + OAuth ▪ I don’t have a good answer here--any thoughts appreciated! ▪ Monday, March 9, 2009
  • 16. Privacy and Security Every organization must reinvent and build expertise in these mechanisms ▪ Components ▪ Physical security ▪ Cloud connection: authentication, authorization, encryption ▪ Audit logging ▪ Data obfuscation ▪ Separation from other customers in multi-tenant environment ▪ Segregation of individual users within a customer’s cloud ▪ Storage retirement (disk shredding!) ▪ Controlling access of cloud provider employees ▪ Compliance, certification, and legislation ▪ Ramifications of security breach ▪ Monday, March 9, 2009
  • 17. Cloud Seeding Let’s Get This Party Started Freedom OSS oers AWS-certified “Cloud Data Transfer Service” ▪ See http://www.freedomoss.com/clouddataingestion ▪ Bycast puts two or more “edge servers” on premise to perform ▪ initial data ingestion, then ships those servers to their cloud data center See http://tr.im/h9PH ▪ If you can’t physically ship the disks, leverage Metro Ethernet or ▪ a dedicated link Investigate modified network stacks (see following slide) ▪ Monday, March 9, 2009
  • 18. Bulk Data Transfer Between Data Centers Companies ▪ WAM!NET (bought by SAVVIS) ▪ Aspera Software ▪ Protocols ▪ GridFTP ▪ UDT ▪ Unix utility: bbcp ▪ Modify congestion control ▪ WAN optimization tricks: compress, transfer deltas, cache, etc. ▪ Peering, transit, OC levels, all that good stu ▪ Monday, March 9, 2009
  • 19. Data Transfer Within a Data Center Hierarchical topology ▪ Border routers, core switches, and top of rack switches ▪ Top of rack switches usually oversubscribed ▪ Diversity of protocols ▪ Ethernet, Infiniband, Fibre Channel, PCI Express, etc. ▪ Networking companies working to flatten topology and unify ▪ protocols Cisco: Data Center Ethernet (DCE) ▪ Juniper: Stratus, a.k.a. Data Center Fabric (DCF) ▪ MapReduce architected to push computation to the data; will ▪ such logic be necessary in the near future? Monday, March 9, 2009
  • 20. Data Transfer Between Clouds Most cloud providers present novel APIs for data retrieval ▪ e.g. S3, SimpleDB, App Engine data store, etc. ▪ It’s usually cheaper (or free) to transfer data within a cloud ▪ Standards and organizations are emerging ▪ Open Virtualization Format (OVF) ▪ Open Cloud Consortium (OCC) ▪ Cloud Computing Interoperability Forum (CCIF) ▪ “Unified Cloud Interface” (UCI) ▪ Their diagrams scare me, a little ▪ Monday, March 9, 2009
  • 21. Service and Application Changes “Pay as you go” is shared motto of Dataspaces and the Cloud ▪ Not a coincidence ▪ Persisting data into information platform should be trivial ▪ Layer storage and processing capabilities onto platform ▪ Catalog ▪ Search ▪ Query ▪ Statistics and Machine Learning ▪ Materialize data into storage system best suited to workload ▪ Leverage workload metadata to get better over time ▪ Monday, March 9, 2009
  • 22. Future Stages Potential Evolutions, pt. 1 Global snapshots of the distributed file system ▪ Tiered storage to accommodate “cold” data ▪ Streaming computations over live data ▪ Higher-level libraries for text mining, linear algebra, etc. ▪ Tighter coupling between data collection, job scheduling, and ▪ reporting via a single metadata repository Testing and debugging frameworks ▪ Proliferation of data marts/sandboxes ▪ Accommodate compute-intensive workloads ▪ Monday, March 9, 2009
  • 23. Future Stages Potential Evolutions, pt. 2 Seamless collection of data sets from the web ▪ Wider variety of physical operators (cf. System R* through Dryad) ▪ Separate access APIs for dierent classes of users ▪ Infrastructure engineers ▪ Product engineers ▪ Data scientists ▪ Business analysts ▪ DSLs for domain-specific work ▪ Utilize browser as client (AJAX, Comet, Gears, etc.) ▪ Monday, March 9, 2009
  • 24. Future Stages Potential Evolutions, pt. 3 Workflow cloning ▪ Recommended analyses based on workload and user metadata ▪ Automatic keyword search ▪ Integrity constraint checking and enforcement ▪ Granular access controls ▪ Metadata evolution history ▪ Table statistics and Hive query optimization ▪ Utilization optimization regularized by customer satisfaction ▪ Currency-based scheduling (cf. Thomas Sandholm’s work) ▪ Monday, March 9, 2009
  • 25. Random Set of References For a more complete bibliography, just ask ▪ “The Cost of a Cloud” ▪ “Above the Clouds” ▪ “A Conversation with Jim Gray” ▪ “Rules of Thumb in Data Engineering” ▪ “Distributed Computing Economics” ▪ “From Databases to Dataspaces” ▪ Dryad and SPC papers ▪ Monday, March 9, 2009
  • 26. (c) 2009 Cloudera, Inc. or its licensors.  quot;Clouderaquot; is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Monday, March 9, 2009