SlideShare une entreprise Scribd logo
1  sur  51
Télécharger pour lire hors ligne
A Behind-the-Scenes Tour


        Jeff Dean
      Google Fellow
    jeff@google.com
User’s View of Google
        Organizing the world’s information and making it
               universally accessible and useful
A Computer Scientist’s View of Google

Problems span a wide range of areas:

                    Product design
                    User interfaces
 Machine learning, Statistics, Information retrieval, AI
                                                           …and much,
              Data structures, Algorithms
                                                           much more!
         Compilers, Programming languages
  Networking, Distributed systems, Fault tolerance
          Hardware, Mechanical engineering
Hardware Design Philosophy
• Prefer low-end server/PC-class designs
 – Build lots of them!

• Why?
 – Single machine performance is not interesting
    • Even smaller problems are too large for any single system
    • Large problems have lots of available parallelism
 – Lots of commodity machines gives best performance/$
“Google” Circa 1997 (google.stanford.edu)
Google (circa 1999)
Google Data Center (Circa 2000)
Google (new data center 2001)
Google Data Center (3 days later)
Current Design

• In-house rack design
• PC-class motherboards
• Low-end storage and
  networking hardware
• Linux
• + in-house software
Multicore Computing
The Joys of Real Hardware
•   Typical first year for a new cluster:

•   ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
•   ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
•   ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
•   ~1 network rewiring (rolling ~5% of machines down over 2-day span)
•   ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
•   ~5 racks go wonky (40-80 machines see 50% packetloss)
•   ~8 network maintenances (4 might cause ~30-minute random connectivity losses)
•   ~12 router reloads (takes out DNS and external vips for a couple minutes)
•   ~3 router failures (have to immediately pull traffic for an hour)
•   ~dozens of minor 30-second blips for dns
•   ~1000 individual machine failures
•   ~thousands of hard drive failures


•   slow disks, bad memory, misconfigured machines, flaky machines, etc.
Implications of our Computing Environment

• Stuff Breaks
• If you have one server, it may stay up three years (1,000 days)
• If you have 10,000 servers, expect to lose ten a day
• “Ultra-reliable” hardware doesn’t really help
• At large scale, super-fancy reliable hardware still fails, albeit less often
   – software still needs to be fault-tolerant
   – commodity machines without fancy hardware give better perf/$


• Reliability has to come from the software

• How can we make it easy to write distributed programs?
Rest of Talk

• Overview of Infrastructure
  – GFS, MapReduce, BigTable

• A peek at query serving & machine translation
• Some fun with interesting data
• General software engineering style/philosophy
GFS Design

                                  GFS Master                   Misc. servers




                       Replicas
                                                                Client
            Masters
                                  GFS Master
                                                                Client
                                                               Client


  C0    C1                        C1            C0    C5

  C5   C2             C5          C3    …             C2


Chunkserver 1     Chunkserver 2                Chunkserver N
• Master manages metadata
• Data transfers happen directly between
  clients/chunkservers
• Files broken into chunks (typically 64 MB)
GFS Usage @ Google
• 200+ clusters
• Many clusters of 1000s of machines
• Pools of 1000s of clients
• 4+ PB Filesystems
• 40 GB/s read/write load
• (in the presence of frequent HW failures)
MapReduce

• A simple programming model that applies to many large-scale
 computing problems


• Hide messy details in MapReduce runtime library:
  –   automatic parallelization
  –   load balancing
  –   network and disk transfer optimizations
  –   handling of machine failures
  –   robustness
  –   improvements to core library benefit all users of library!
Typical problem solved by MapReduce

• Read a lot of data
• Map: extract something you care about from each record
• Shuffle and Sort
• Reduce: aggregate, summarize, filter, or transform
• Write the results

                • Outline stays the same,
         map and reduce change to fit the problem
Processing Large Datasets




Geographic Data      Index Files   Data Center
Transforming data

   Input                                                                              Output
Feature List                                         Intersection List
1: <type=Road>, <intersections=(3)>, <geom>, …       3: <type=Intersection>, <roads=(

2: <type=Road>, <intersections=(3)>, <geom>, …           1: <type=Road>, <geom>, <name>, …

3: <type=Intersection>, <roads=(1,2,5)>, …               2: <type=Road>, <geom>, <name>, …

4: <type=Road>, <intersections=(6)>, <geom>,             5: <type=Road>, <geom>, <name>, …)>, …

5: <type=Road>, <intersections=(3,6)>, <geom>, …     6: <type=Intersection>, <roads=(

6: <type=Intersection>, <roads=(5,6,7)>, …               4: <type=Road>, <geom>, <name>, … >

7: <type=Road>, <intersections=(6)>, <geom>, …           5: <type=Road>, <geom>, <name>, … >

8: <type=Border>, <name>, <geom>, …                      7: <type=Road>, <geom>, <name>, …)>, …

                        .                                                      .
                                                             7
                        .             1                                        .

                        .                                                      .
                                                 3
                                                              6
                                                     5
                                      2
                                                             4
Map-Reduce Programming Model
   Input                Map              Shuffle                Reduce                 Output
                  Apply map() to each;                     Apply reduce() to list
List of items                            Sort by key                                New list of items
                  emit (key,val) pairs                     of pairs with same key



                                                                                    3: Intersection
   1: Road          (3, 1: Road)
                                                            3    (3, 1: Road)
                                                                                      1: Road,
   2: Road          (3, 2: Road)                                 (3, 2: Road)
                                                                                      2: Road,
3: Intersection     (3, 3: Intxn)                                (3, 3: Intxn.)       5: Road
   4: Road          (6, 4: Road)                                 (3, 5: Road)
   5: Road          (3, 5: Road)                                                    6: Intersection
6: Intersection     (6, 5: Road)
                                                            6    (6, 4: Road)
                                                                                      4: Road,
                                                                 (6, 5: Road)
                                                                                      5: Road,
   7: Road          (6, 6: Intxn)                                (6, 6: Intxn.)       7: Road
                    (6, 7: Road)                                 (6, 7: Road)

                                                       7
                                    1
                                           3           6
                                               5
                                    2
                                                       4
Code Example
•     class IntersectionAssemblerMapper : public
      Mapper {
•         virtual void Map(MapInput* input) {
•           GeoFeature feature;
•           feature.FromMapInput(input);
•           if (feature.type()==FEATURE_INTERSECTION) {
•             Emit(feature.id(), input);
•           } else if (feature.type() == FEATURE_ROAD)
                                                     class IntersectionAssemblerReducer:public Reducer {
      {
•             Emit(feature.intersection_id(0), input);
                                                       virtual void Reduce(ReduceInput* input) {
•            Emit(feature.intersection_id(1), input);    GeoFeature feature;
•          }                                             GraphIntersection intersection;
•        }                                               intersection.id = input->key();
•     };                                                 while(!input->done()) {
(3,
•
       1: Road)
                              3    (3, 1: Road)
      REGISTER_MAPPER(IntersectionAssemblerMapper);        feature.FromMapInput(input->value());
(3,    2: Road)                    (3, 2: Road)            if (feature.type()==FEATURE_INTERSECTION)
(3,    3: Intxn)                   (3, 3: Intxn.)            intersection.SetIntersection(feature);
                                                           else
(6,    4: Road)                    (3, 5: Road)              intersection.AddRoadFeature(feature);
(3,    5: Road)                                            input->next();

(6,    5: Road)
                              6    (6, 4: Road)
                                                         }
                                   (6, 5: Road)          Emit(intersection);
(6,    6: Intxn)                   (6, 6: Intxn.)      }
(6,    7: Road)                                     };
                                   (6, 7: Road)
                                                    REGISTER_REDUCER(IntersectionAssemblerReducer);
Rendering Map Tiles

   Input                 Map              Shuffle             Reduce                Output
                    Emit each to all                        Render tile using
  Geographic                              Sort by key
                  overlapping latitude-                   data for all enclosed   Rendered tiles
 feature list                           (key= Rect. Id)
                   longitude rectangles                          features


     I-5             (N, I-5)
                                                          N    (N, I-5)
Lake Washington      (S, I-5)                                  (N, Lake Wash.)
    WA-520           (N, Lake Wash.)                           (N, WA-520)
     I-90            (S, Lake Wash.)                                  …
       …             (N, WA-520)
                     (S, I-90)
                            …
                                                          S    (S, I-5)
                                                               (S, Lake Wash.)
                                                               (S, I-90)

                                                                      …
Widely applicable at Google

    – Implemented as a C++ library linked to user programs
    – Can read and write many different data types


•                Example uses:

         distributed grep         web access log stats
         distributed sort         web link-graph reversal
         term-vector per host     inverted index construction
         document clustering      statistical machine translation
         machine learning
                                  …
         ...
Input
Parallel MapReduce                                        data


  Map         Map             Map             Map



                                                       Master


    Shuffle         Shuffle         Shuffle

    Reduce          Reduce          Reduce          Partitioned
                                                      output
Example Status
                      Reducing




Mapping / Shuffling
MapReduce Programs in Google’s Source Tree
New MapReduce Programs Per Month

                  Summer intern effect
Usage Statistics Over Time

                                 Aug, ’04   Mar, ’05 Mar, ’06   Sep, '07
Number of jobs                       29K       72K      171K     2,217K
Average completion time (secs)       634       934       874        395
Machine years used                   217        981    2,002     11,081
Input data read (TB)               3,288     12,571   52,254    403,152
Intermediate data (TB)               758      2,756    6,743     34,774
Output data written (TB)             193        941    2,970     14,018
Average worker machines              157        232      268        394
BigTable: Motivation

• Lots of (semi-)structured data at Google
   – URLs:
      • Contents, crawl metadata, links, anchors, pagerank, …
   – Per-user data:
      • User preference settings, recent queries/search results,
        …
   – Geographic locations:
      • Physical entities (shops, restaurants, etc.), roads,
        satellite image data, user annotations, …
• Scale is large
   – billions of URLs, many versions/page (~20K/version)
   – Hundreds of millions of users, thousands of q/sec
   – 100TB+ of satellite image data
Why not just use commercial DB?

• Scale is too large for most commercial databases

• Even if it weren’t, cost would be very high
   – Building internally means system can be applied
     across many projects for low incremental cost

• Low-level storage optimizations help performance
  significantly
   – Much harder to do when running on top of a
     database layer

   – Also fun and challenging to build large-scale
     systems :)
Basic Data Model
• Distributed multi-dimensional sparse map
   – (row, column, timestamp) ! cell contents
                             “contents:”   Columns

       Rows

                                                    t3
  “www.cnn.com”                               t11
                            “<html>…”       t17
                                        Timestamps



• Rows are ordered lexicographically
• Good match for most of our applications
Tablets & Splitting

                          “language:”   “contents:”

“aaa.com”

“cnn.com”                     EN         “<html>…”


“cnn.com/sports.html”
   Tablets
                                    …

“website.com”
        …
“yahoo.com/kids.html”
        …
“yahoo.com/kids.html0”
       …
“zuppa.com/menu.html”
Tablet Representation

read
              write buffer in memory
                                          append-only log on GFS
                 (random-access)




                                                                    write
                                                     SSTable
                   SSTable             SSTable
                                                     on GFS
                   on GFS              on GFS
                                                     (mmap)


                                                         Tablet


       SSTable: Immutable on-disk ordered map from string->string
              string keys: <row, column, timestamp> triples
BigTable System Structure
                                                                Bigtable client
Bigtable Cell                                                   Bigtable client
                                              metadata ops
                                                                    library

                              Bigtable master
                          performs metadata ops +                           Open()
                                                          read/write
                               load balancing


Bigtable tablet server      Bigtable tablet server   …      Bigtable tablet server

    serves data                 serves data                      serves data


  Cluster scheduling system              GFS                     Lock service
                                                              holds metadata,
 handles failover, monitoring   holds tablet data, logs
                                                           handles master-election
BigTable Status
• Design/initial implementation started beginning of 2004
• Currently ~500 BigTable cells
• Production use or active development for ~70 projects:
   –   Google Print
   –   My Search History
   –   Orkut
   –   Crawling/indexing pipeline
   –   Google Maps/Google Earth
   –   Blogger
   –   …
• Largest bigtable cell manages ~6000TB of data spread
  over several thousand machines (larger cells planned)
Future Infrastructure Directions

•    Existing systems mostly designed to work within cluster or
     datacenter

•    Current work:
•    Building next generation systems that span all our datacenters
–    single global namespace
–    stronger consistency across datacenters
•    tricky in presence of partitions
–    allow higher-level constraints:
•    “please keep this data on 2 disks in U.S., 2 in Europe and 1 in Asia”
–    design goal: much more automated operation
•    automatic growing and shrinking of capacity for all jobs
•    automatic movement of jobs between datacenters
•    automatic migration/replication of data across datacenter boundaries
Google Query Serving Infrastructure
  Misc. servers                       query                             Many corpora
  Spell checker                                                                News
                                Google Web Server
  Ad Server                                                                    Books

                                                                                ...
 Index servers                                                   Doc servers


            I0   I1   I2         IN                  D0     D1         DM

                           …                                     …
 Replicas




                                          Replicas
            I0   I1   I2         IN                  D0     D1         DM
                      …




                                                          …
            I0   I1   I2         IN                  D0     D1         DM
                 Index shards                             Doc shards

Elapsed time: 0.25s, machines involved: 1000s+
Machine Translation

Goal: High quality translation of natural language text

                                   India and the US…

         !,                        According to the…
                      MT
 quot;         #                       Green potatoes fly
Statistical Approach

Viewpoint of statistical Machine Translation (MT):
• Build probabilistic model of translation process
• Explore translation space to maximal prob. translated sentence, given
  source sentence
Main source of data for building statistical models:
• Parallel aligned corpora (text with sentence-by-sentence translations)
• Source and target language models (trained on huge amounts of text)
  – 5-gram target lang. model makes translations sound more natural

Try Chinese & Arabic translation systems at translate.google.com
Statistical Approach

Language models trained on > 1000 billion words are huge
• 45.6 billion 5-grams
• 66.5% singletons: but, filtering rare events hurts Bleu score
• 1.5 terabyte of count data

Fun system design problems:
• Each translated sentence needs 100,000 to 1M language model lookups
• Language model doesn’t fit on single machine: needs 100s of machines
NIST 2005 Results (%




BLEU score: roughly fraction of multi-word phrase overlap with set of human translations

Official results at:
http://www.nist.gov/speech/tests/mt/mt05eval_official_results_release_20050801_v3.html
Example translation: Arabic - English
(non-Google translation service available on the web)
The Bradi : The inspectors need to “a few months” for end
important their


Paris 13 - 1 ( aa so in in ) - the general manager for agency
announced international for energy atomic Mohammed the Bradi
today Monday that inspectors of international nze' the weapons
need to “a few months” for end important their in Iraq.
Journalistic conference in end of meeting with French External
Minister of Dominique de Villepin that the inspectors said during
“a few their need to important months for end”.
...
The Bradi that Security Council confirmed “understands” that
January 27 final term not.
…
Example translation: Arabic - English
(Google System; 2005)
El Baradei : Inspectors Need quot;a Few Monthsquot; to Complete Their
Mission


Paris 13 - 1 ( AFP ) - The Director General of the International
Atomic Energy Agency Mohamed El Baradei announced today,
Monday, that the international disarmament inspectors need quot;a
few monthsquot; to complete their mission in Iraq.
He said during a press conference at the conclusion of a meeting
with French Foreign Minister Dominique de Villepin that the
inspectors quot;need a few months to complete their mission.”
...
El Baradei stressed that the Security Council quot;understandsquot; that
the 27 January deadline is not final.
…
More Data is Better

   Each doubling of LM training corpus size: ~0.5% higher BLEU score




Bleu%
score




                  Total tokens processed to build language model
Locations Mentioned in Books Over Time




                         1700              1880




                         1760              1940




                         1820              2000

Thanks to Matthew Gray
Source Code Philosophy
•   Google has one large shared source base
     – lots of lower-level libraries used by almost everything
     – higher-level app or domain-specific libraries
     – application-specific code

•   Many benefits:
     – improvements in core libraries benefit everyone
     – easy to reuse code that someone else has written in another context

•   Drawbacks:
     – reuse sometimes leads to tangled dependencies

•   Essential to be able to easily search whole source base
     – gsearch: internal tool for fast regexp searching of source code
     – huge productivity boost: easy to find uses, defs, examples, etc.
     – makes large-scale refactoring or renaming easier
Software Engineering Hygiene
•   Code reviews
•   Design reviews
•   Lots of testing
     – unittests for individual modules
     – larger tests for whole systems
     – continuous testing system


•   Most development done in C++, Java, & Python
     – C++: performance critical systems (e.g. everything for a web query)
     – Java: lower volume apps (advertising front end, parts of gmail, etc.)
     – Python: configuration tools, etc.
Multi-Site Software Engineering
•   Google has moved from one to a handful to 30+ engineering sites
    around the world in last few years

•   Motivation:
     – hire best candidates, regardless of their geographic location


•   Issues:
     – more coordination needed
     – communication somewhat harder (no hallway conversations, time zone
       issues)
     – establishing trust between remote teams important


•   Techniques:
     – online documentation, e-mail, video conferencing, careful choice of
       interfaces/project decomposition
     – example: BigTable project is split across three sites
Fun Environment for Software
                 Engineering
•   Very interesting problems
    – wide range of areas: low level hw/sw, dist. systems, storage systems,
      information retrieval, machine learning, user interfaces, auction theory,
      new product design, etc.
    – lots of interesting data and computational resources


•   Service-based model for software development is very nice
    – very fluid, easy to make changes, easy to test, small teams can
      accomplish a lot


•   Great colleagues/environment
    – expertise in wide range of areas, lots of interesting talks, etc.


•   Work has significant impact
    – hundreds of millions of users every month
Thanks! Questions...?

Further reading:
• Ghemawat, Gobioff, & Leung. Google File System, SOSP 2003.

• Barroso, Dean, & Hölzle . Web Search for a Planet: The Google Cluster Architecture, IEEE
 Micro, 2003.

• Dean & Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, OSDI
 2004.

• Chang, Dean, Ghemawat, Hsieh, Wallach, Burrows, Chandra, Fikes, & Gruber. Bigtable: A
 Distributed Storage System for Structured Data, OSDI 2006.

• Brants, Popat, Xu, Och, & Dean. Large Language Models in Machine Translation, EMNLP
 2007.


These and many more available at:
• http://labs.google.com/papers.html
• http://labs.google.com/people/jeff

Contenu connexe

Similaire à Behind-the-Scenes Tour of Google's Systems

Cloud computing_processing frameworks
Cloud computing_processing frameworksCloud computing_processing frameworks
Cloud computing_processing frameworksReem Abdel-Rahman
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Migrating To Ruby1.9
Migrating To Ruby1.9Migrating To Ruby1.9
Migrating To Ruby1.9tomaspavelka
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processinghuguk
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real WorldAchim Friedland
 
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...Xavier Llorà
 
Infrastructure for cloud_computing
Infrastructure for cloud_computingInfrastructure for cloud_computing
Infrastructure for cloud_computingJULIO GONZALEZ SANZ
 
Practical Domain-Specific Languages in Groovy
Practical Domain-Specific Languages in GroovyPractical Domain-Specific Languages in Groovy
Practical Domain-Specific Languages in GroovyGuillaume Laforge
 
Zh Tw Introduction To Map Reduce
Zh Tw Introduction To Map ReduceZh Tw Introduction To Map Reduce
Zh Tw Introduction To Map Reducekevin liao
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processingjins0618
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Archnextlib
 

Similaire à Behind-the-Scenes Tour of Google's Systems (20)

Practical Groovy DSL
Practical Groovy DSLPractical Groovy DSL
Practical Groovy DSL
 
Cloud computing_processing frameworks
Cloud computing_processing frameworksCloud computing_processing frameworks
Cloud computing_processing frameworks
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Migrating To Ruby1.9
Migrating To Ruby1.9Migrating To Ruby1.9
Migrating To Ruby1.9
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processing
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
MapReduce
MapReduceMapReduce
MapReduce
 
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
 
Infrastructure for cloud_computing
Infrastructure for cloud_computingInfrastructure for cloud_computing
Infrastructure for cloud_computing
 
Practical Domain-Specific Languages in Groovy
Practical Domain-Specific Languages in GroovyPractical Domain-Specific Languages in Groovy
Practical Domain-Specific Languages in Groovy
 
Zh Tw Introduction To Map Reduce
Zh Tw Introduction To Map ReduceZh Tw Introduction To Map Reduce
Zh Tw Introduction To Map Reduce
 
Using MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image AnalysisUsing MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image Analysis
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
What is OpenGL ?
What is OpenGL ?What is OpenGL ?
What is OpenGL ?
 
Vidoop CouchDB Talk
Vidoop CouchDB TalkVidoop CouchDB Talk
Vidoop CouchDB Talk
 
Scalding
ScaldingScalding
Scalding
 

Plus de Hiroshi Ono

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipediaHiroshi Ono
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説Hiroshi Ono
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitectureHiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdfHiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdfHiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfHiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfHiroshi Ono
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfHiroshi Ono
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdfHiroshi Ono
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdfHiroshi Ono
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdfHiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfHiroshi Ono
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfHiroshi Ono
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdfHiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfHiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdfHiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdfHiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfHiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfHiroshi Ono
 

Plus de Hiroshi Ono (20)

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipedia
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitecture
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdf
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdf
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdf
 
camel-scala.pdf
camel-scala.pdfcamel-scala.pdf
camel-scala.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdf
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 

Dernier

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Dernier (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Behind-the-Scenes Tour of Google's Systems

  • 1. A Behind-the-Scenes Tour Jeff Dean Google Fellow jeff@google.com
  • 2. User’s View of Google Organizing the world’s information and making it universally accessible and useful
  • 3. A Computer Scientist’s View of Google Problems span a wide range of areas: Product design User interfaces Machine learning, Statistics, Information retrieval, AI …and much, Data structures, Algorithms much more! Compilers, Programming languages Networking, Distributed systems, Fault tolerance Hardware, Mechanical engineering
  • 4. Hardware Design Philosophy • Prefer low-end server/PC-class designs – Build lots of them! • Why? – Single machine performance is not interesting • Even smaller problems are too large for any single system • Large problems have lots of available parallelism – Lots of commodity machines gives best performance/$
  • 5. “Google” Circa 1997 (google.stanford.edu)
  • 7. Google Data Center (Circa 2000)
  • 8. Google (new data center 2001)
  • 9. Google Data Center (3 days later)
  • 10. Current Design • In-house rack design • PC-class motherboards • Low-end storage and networking hardware • Linux • + in-house software
  • 12. The Joys of Real Hardware • Typical first year for a new cluster: • ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) • ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) • ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) • ~1 network rewiring (rolling ~5% of machines down over 2-day span) • ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) • ~5 racks go wonky (40-80 machines see 50% packetloss) • ~8 network maintenances (4 might cause ~30-minute random connectivity losses) • ~12 router reloads (takes out DNS and external vips for a couple minutes) • ~3 router failures (have to immediately pull traffic for an hour) • ~dozens of minor 30-second blips for dns • ~1000 individual machine failures • ~thousands of hard drive failures • slow disks, bad memory, misconfigured machines, flaky machines, etc.
  • 13. Implications of our Computing Environment • Stuff Breaks • If you have one server, it may stay up three years (1,000 days) • If you have 10,000 servers, expect to lose ten a day • “Ultra-reliable” hardware doesn’t really help • At large scale, super-fancy reliable hardware still fails, albeit less often – software still needs to be fault-tolerant – commodity machines without fancy hardware give better perf/$ • Reliability has to come from the software • How can we make it easy to write distributed programs?
  • 14. Rest of Talk • Overview of Infrastructure – GFS, MapReduce, BigTable • A peek at query serving & machine translation • Some fun with interesting data • General software engineering style/philosophy
  • 15. GFS Design GFS Master Misc. servers Replicas Client Masters GFS Master Client Client C0 C1 C1 C0 C5 C5 C2 C5 C3 … C2 Chunkserver 1 Chunkserver 2 Chunkserver N • Master manages metadata • Data transfers happen directly between clients/chunkservers • Files broken into chunks (typically 64 MB)
  • 16. GFS Usage @ Google • 200+ clusters • Many clusters of 1000s of machines • Pools of 1000s of clients • 4+ PB Filesystems • 40 GB/s read/write load • (in the presence of frequent HW failures)
  • 17. MapReduce • A simple programming model that applies to many large-scale computing problems • Hide messy details in MapReduce runtime library: – automatic parallelization – load balancing – network and disk transfer optimizations – handling of machine failures – robustness – improvements to core library benefit all users of library!
  • 18. Typical problem solved by MapReduce • Read a lot of data • Map: extract something you care about from each record • Shuffle and Sort • Reduce: aggregate, summarize, filter, or transform • Write the results • Outline stays the same, map and reduce change to fit the problem
  • 19. Processing Large Datasets Geographic Data Index Files Data Center
  • 20. Transforming data Input Output Feature List Intersection List 1: <type=Road>, <intersections=(3)>, <geom>, … 3: <type=Intersection>, <roads=( 2: <type=Road>, <intersections=(3)>, <geom>, … 1: <type=Road>, <geom>, <name>, … 3: <type=Intersection>, <roads=(1,2,5)>, … 2: <type=Road>, <geom>, <name>, … 4: <type=Road>, <intersections=(6)>, <geom>, 5: <type=Road>, <geom>, <name>, …)>, … 5: <type=Road>, <intersections=(3,6)>, <geom>, … 6: <type=Intersection>, <roads=( 6: <type=Intersection>, <roads=(5,6,7)>, … 4: <type=Road>, <geom>, <name>, … > 7: <type=Road>, <intersections=(6)>, <geom>, … 5: <type=Road>, <geom>, <name>, … > 8: <type=Border>, <name>, <geom>, … 7: <type=Road>, <geom>, <name>, …)>, … . . 7 . 1 . . . 3 6 5 2 4
  • 21. Map-Reduce Programming Model Input Map Shuffle Reduce Output Apply map() to each; Apply reduce() to list List of items Sort by key New list of items emit (key,val) pairs of pairs with same key 3: Intersection 1: Road (3, 1: Road) 3 (3, 1: Road) 1: Road, 2: Road (3, 2: Road) (3, 2: Road) 2: Road, 3: Intersection (3, 3: Intxn) (3, 3: Intxn.) 5: Road 4: Road (6, 4: Road) (3, 5: Road) 5: Road (3, 5: Road) 6: Intersection 6: Intersection (6, 5: Road) 6 (6, 4: Road) 4: Road, (6, 5: Road) 5: Road, 7: Road (6, 6: Intxn) (6, 6: Intxn.) 7: Road (6, 7: Road) (6, 7: Road) 7 1 3 6 5 2 4
  • 22. Code Example • class IntersectionAssemblerMapper : public Mapper { • virtual void Map(MapInput* input) { • GeoFeature feature; • feature.FromMapInput(input); • if (feature.type()==FEATURE_INTERSECTION) { • Emit(feature.id(), input); • } else if (feature.type() == FEATURE_ROAD) class IntersectionAssemblerReducer:public Reducer { { • Emit(feature.intersection_id(0), input); virtual void Reduce(ReduceInput* input) { • Emit(feature.intersection_id(1), input); GeoFeature feature; • } GraphIntersection intersection; • } intersection.id = input->key(); • }; while(!input->done()) { (3, • 1: Road) 3 (3, 1: Road) REGISTER_MAPPER(IntersectionAssemblerMapper); feature.FromMapInput(input->value()); (3, 2: Road) (3, 2: Road) if (feature.type()==FEATURE_INTERSECTION) (3, 3: Intxn) (3, 3: Intxn.) intersection.SetIntersection(feature); else (6, 4: Road) (3, 5: Road) intersection.AddRoadFeature(feature); (3, 5: Road) input->next(); (6, 5: Road) 6 (6, 4: Road) } (6, 5: Road) Emit(intersection); (6, 6: Intxn) (6, 6: Intxn.) } (6, 7: Road) }; (6, 7: Road) REGISTER_REDUCER(IntersectionAssemblerReducer);
  • 23. Rendering Map Tiles Input Map Shuffle Reduce Output Emit each to all Render tile using Geographic Sort by key overlapping latitude- data for all enclosed Rendered tiles feature list (key= Rect. Id) longitude rectangles features I-5 (N, I-5) N (N, I-5) Lake Washington (S, I-5) (N, Lake Wash.) WA-520 (N, Lake Wash.) (N, WA-520) I-90 (S, Lake Wash.) … … (N, WA-520) (S, I-90) … S (S, I-5) (S, Lake Wash.) (S, I-90) …
  • 24. Widely applicable at Google – Implemented as a C++ library linked to user programs – Can read and write many different data types • Example uses: distributed grep web access log stats distributed sort web link-graph reversal term-vector per host inverted index construction document clustering statistical machine translation machine learning … ...
  • 25. Input Parallel MapReduce data Map Map Map Map Master Shuffle Shuffle Shuffle Reduce Reduce Reduce Partitioned output
  • 26. Example Status Reducing Mapping / Shuffling
  • 27. MapReduce Programs in Google’s Source Tree
  • 28. New MapReduce Programs Per Month Summer intern effect
  • 29. Usage Statistics Over Time Aug, ’04 Mar, ’05 Mar, ’06 Sep, '07 Number of jobs 29K 72K 171K 2,217K Average completion time (secs) 634 934 874 395 Machine years used 217 981 2,002 11,081 Input data read (TB) 3,288 12,571 52,254 403,152 Intermediate data (TB) 758 2,756 6,743 34,774 Output data written (TB) 193 941 2,970 14,018 Average worker machines 157 232 268 394
  • 30. BigTable: Motivation • Lots of (semi-)structured data at Google – URLs: • Contents, crawl metadata, links, anchors, pagerank, … – Per-user data: • User preference settings, recent queries/search results, … – Geographic locations: • Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, … • Scale is large – billions of URLs, many versions/page (~20K/version) – Hundreds of millions of users, thousands of q/sec – 100TB+ of satellite image data
  • 31. Why not just use commercial DB? • Scale is too large for most commercial databases • Even if it weren’t, cost would be very high – Building internally means system can be applied across many projects for low incremental cost • Low-level storage optimizations help performance significantly – Much harder to do when running on top of a database layer – Also fun and challenging to build large-scale systems :)
  • 32. Basic Data Model • Distributed multi-dimensional sparse map – (row, column, timestamp) ! cell contents “contents:” Columns Rows t3 “www.cnn.com” t11 “<html>…” t17 Timestamps • Rows are ordered lexicographically • Good match for most of our applications
  • 33. Tablets & Splitting “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” Tablets … “website.com” … “yahoo.com/kids.html” … “yahoo.com/kids.html0” … “zuppa.com/menu.html”
  • 34. Tablet Representation read write buffer in memory append-only log on GFS (random-access) write SSTable SSTable SSTable on GFS on GFS on GFS (mmap) Tablet SSTable: Immutable on-disk ordered map from string->string string keys: <row, column, timestamp> triples
  • 35. BigTable System Structure Bigtable client Bigtable Cell Bigtable client metadata ops library Bigtable master performs metadata ops + Open() read/write load balancing Bigtable tablet server Bigtable tablet server … Bigtable tablet server serves data serves data serves data Cluster scheduling system GFS Lock service holds metadata, handles failover, monitoring holds tablet data, logs handles master-election
  • 36. BigTable Status • Design/initial implementation started beginning of 2004 • Currently ~500 BigTable cells • Production use or active development for ~70 projects: – Google Print – My Search History – Orkut – Crawling/indexing pipeline – Google Maps/Google Earth – Blogger – … • Largest bigtable cell manages ~6000TB of data spread over several thousand machines (larger cells planned)
  • 37. Future Infrastructure Directions • Existing systems mostly designed to work within cluster or datacenter • Current work: • Building next generation systems that span all our datacenters – single global namespace – stronger consistency across datacenters • tricky in presence of partitions – allow higher-level constraints: • “please keep this data on 2 disks in U.S., 2 in Europe and 1 in Asia” – design goal: much more automated operation • automatic growing and shrinking of capacity for all jobs • automatic movement of jobs between datacenters • automatic migration/replication of data across datacenter boundaries
  • 38. Google Query Serving Infrastructure Misc. servers query Many corpora Spell checker News Google Web Server Ad Server Books ... Index servers Doc servers I0 I1 I2 IN D0 D1 DM … … Replicas Replicas I0 I1 I2 IN D0 D1 DM … … I0 I1 I2 IN D0 D1 DM Index shards Doc shards Elapsed time: 0.25s, machines involved: 1000s+
  • 39. Machine Translation Goal: High quality translation of natural language text India and the US… !, According to the… MT quot; # Green potatoes fly
  • 40. Statistical Approach Viewpoint of statistical Machine Translation (MT): • Build probabilistic model of translation process • Explore translation space to maximal prob. translated sentence, given source sentence Main source of data for building statistical models: • Parallel aligned corpora (text with sentence-by-sentence translations) • Source and target language models (trained on huge amounts of text) – 5-gram target lang. model makes translations sound more natural Try Chinese & Arabic translation systems at translate.google.com
  • 41. Statistical Approach Language models trained on > 1000 billion words are huge • 45.6 billion 5-grams • 66.5% singletons: but, filtering rare events hurts Bleu score • 1.5 terabyte of count data Fun system design problems: • Each translated sentence needs 100,000 to 1M language model lookups • Language model doesn’t fit on single machine: needs 100s of machines
  • 42. NIST 2005 Results (% BLEU score: roughly fraction of multi-word phrase overlap with set of human translations Official results at: http://www.nist.gov/speech/tests/mt/mt05eval_official_results_release_20050801_v3.html
  • 43. Example translation: Arabic - English (non-Google translation service available on the web) The Bradi : The inspectors need to “a few months” for end important their Paris 13 - 1 ( aa so in in ) - the general manager for agency announced international for energy atomic Mohammed the Bradi today Monday that inspectors of international nze' the weapons need to “a few months” for end important their in Iraq. Journalistic conference in end of meeting with French External Minister of Dominique de Villepin that the inspectors said during “a few their need to important months for end”. ... The Bradi that Security Council confirmed “understands” that January 27 final term not. …
  • 44. Example translation: Arabic - English (Google System; 2005) El Baradei : Inspectors Need quot;a Few Monthsquot; to Complete Their Mission Paris 13 - 1 ( AFP ) - The Director General of the International Atomic Energy Agency Mohamed El Baradei announced today, Monday, that the international disarmament inspectors need quot;a few monthsquot; to complete their mission in Iraq. He said during a press conference at the conclusion of a meeting with French Foreign Minister Dominique de Villepin that the inspectors quot;need a few months to complete their mission.” ... El Baradei stressed that the Security Council quot;understandsquot; that the 27 January deadline is not final. …
  • 45. More Data is Better Each doubling of LM training corpus size: ~0.5% higher BLEU score Bleu% score Total tokens processed to build language model
  • 46. Locations Mentioned in Books Over Time 1700 1880 1760 1940 1820 2000 Thanks to Matthew Gray
  • 47. Source Code Philosophy • Google has one large shared source base – lots of lower-level libraries used by almost everything – higher-level app or domain-specific libraries – application-specific code • Many benefits: – improvements in core libraries benefit everyone – easy to reuse code that someone else has written in another context • Drawbacks: – reuse sometimes leads to tangled dependencies • Essential to be able to easily search whole source base – gsearch: internal tool for fast regexp searching of source code – huge productivity boost: easy to find uses, defs, examples, etc. – makes large-scale refactoring or renaming easier
  • 48. Software Engineering Hygiene • Code reviews • Design reviews • Lots of testing – unittests for individual modules – larger tests for whole systems – continuous testing system • Most development done in C++, Java, & Python – C++: performance critical systems (e.g. everything for a web query) – Java: lower volume apps (advertising front end, parts of gmail, etc.) – Python: configuration tools, etc.
  • 49. Multi-Site Software Engineering • Google has moved from one to a handful to 30+ engineering sites around the world in last few years • Motivation: – hire best candidates, regardless of their geographic location • Issues: – more coordination needed – communication somewhat harder (no hallway conversations, time zone issues) – establishing trust between remote teams important • Techniques: – online documentation, e-mail, video conferencing, careful choice of interfaces/project decomposition – example: BigTable project is split across three sites
  • 50. Fun Environment for Software Engineering • Very interesting problems – wide range of areas: low level hw/sw, dist. systems, storage systems, information retrieval, machine learning, user interfaces, auction theory, new product design, etc. – lots of interesting data and computational resources • Service-based model for software development is very nice – very fluid, easy to make changes, easy to test, small teams can accomplish a lot • Great colleagues/environment – expertise in wide range of areas, lots of interesting talks, etc. • Work has significant impact – hundreds of millions of users every month
  • 51. Thanks! Questions...? Further reading: • Ghemawat, Gobioff, & Leung. Google File System, SOSP 2003. • Barroso, Dean, & Hölzle . Web Search for a Planet: The Google Cluster Architecture, IEEE Micro, 2003. • Dean & Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004. • Chang, Dean, Ghemawat, Hsieh, Wallach, Burrows, Chandra, Fikes, & Gruber. Bigtable: A Distributed Storage System for Structured Data, OSDI 2006. • Brants, Popat, Xu, Och, & Dean. Large Language Models in Machine Translation, EMNLP 2007. These and many more available at: • http://labs.google.com/papers.html • http://labs.google.com/people/jeff