SlideShare a Scribd company logo
1 of 48
Download to read offline
Lessons Learned While Building Infrastructure
Software at Google
Jeff Dean
jeff@google.com

Tuesday, September 10, 13
“Google” Circa 1997 (google.stanford.edu)

Tuesday, September 10, 13
“Corkboards” (1999)

Tuesday, September 10, 13
Google Data Center (2000)

Tuesday, September 10, 13
Google Data Center (2000)

Tuesday, September 10, 13
Google Data Center (2000)

Tuesday, September 10, 13
Google (new data center 2001)

Tuesday, September 10, 13
Google Data Center (3 days later)

Tuesday, September 10, 13
Google’s Computational Environment Today
• Many datacenters around the world

Tuesday, September 10, 13
Google’s Computational Environment Today
• Many datacenters around the world

Tuesday, September 10, 13
Zooming In...

Tuesday, September 10, 13
Lots of machines...

Tuesday, September 10, 13
Cool...

Tuesday, September 10, 13
Low-Level Systems Software Desires

• If you have lots of machines, you want to:
• Store data persistently
– w/ high availability
– high read and write bandwidth

• Run large-scale computations reliably
– without having to deal with machine failures

• GFS, MapReduce, BigTable, Spanner, ...

Tuesday, September 10, 13
Masters

C0
C5

C1
C2

Chunkserver 1

•
•
•
•

Replicas

Google File System (GFS) Design
Client
GFS Master
Client
Client

C1
C5

Misc. servers

GFS Master

C3

Chunkserver 2

C0

…

C5
C2

Chunkserver N

Master manages metadata
Data transfers are directly between clients/chunkservers
Files broken into chunks (typically 64 MB)
Chunks replicated across multiple machines (usually 3)

Tuesday, September 10, 13
GFS Motivation and Lessons
• Indexing system clearly needed a large-scale
distributed file system
– wanted to treat whole cluster as single file system

• Developed by subset of same people working on
indexing system
• Identified minimal set of features needed
– e.g. Not POSIX compliant
– actual data was distributed, but kept metadata
centralized
• Colossus: Follow-on system developed many years later
distributed the metadata

• Lesson: Don’t solve everything all at once
Tuesday, September 10, 13
MapReduce History
• 2003: Sanjay Ghemawat and I were working on
rewriting indexing system:
– starts with raw page contents on disk
– many phases:
• (near) duplicate elimination, anchor text extraction, language
identification, index shard generation, etc.

– end result is data structures for index and doc serving

• Each phase was hand written parallel computation:
– hand parallelized
– hand-written checkpointing code for fault-tolerance

Tuesday, September 10, 13
MapReduce
• A simple programming model that applies to many
large-scale computing problems
– allowed us to express all phases of our indexing system
– since used across broad range of computer science areas, plus
other scientific fields
– Hadoop open-source implementation seeing significant usage

• Hide messy details in MapReduce runtime library:
– automatic parallelization
– load balancing
– network and disk transfer optimizations
– handling of machine failures
– robustness
– improvements to core library benefit all users of library!
Tuesday, September 10, 13
Typical problem solved by MapReduce
•
•
•
•
•

Read a lot of data
Map: extract something you care about from each record
Shuffle and Sort
Reduce: aggregate, summarize, filter, or transform
Write the results
Outline stays the same,
User writes Map and Reduce functions to fit the problem

Tuesday, September 10, 13
MapReduce Motivation and Lessons
• Developed by two people that were also doing the
indexing system rewrite
– squinted at various phases with an eye towards coming up with
common abstraction

• Initial version developed quickly
– proved initial API utility with very simple implementation
– rewrote much of implementation 6 months later to add lots of
the performance wrinkles/tricks that appeared in original paper

• Lesson: Very close ties with initial users of system
make things happen faster
– in this case, we were both building MapReduce and
using it simultaneously

Tuesday, September 10, 13
BigTable: Motivation
• Lots of (semi-)structured data at Google
– URLs: Contents, crawl metadata, links, anchors, pagerank, …
– Per-user data: User preferences, recent queries, …
– Geographic locations: Physical entities, roads, satellite image
data, user annotations, …

• Scale is large
• Want to be able to grow and shrink resources devoted
to system as needed

Tuesday, September 10, 13
BigTable: Basic Data Model
• Distributed multi-dimensional sparse map
(row, column, timestamp) → cell contents
Columns
Rows

• Rows are ordered lexicographically
• Good match for most of our applications
Tuesday, September 10, 13
BigTable: Basic Data Model
• Distributed multi-dimensional sparse map
(row, column, timestamp) → cell contents
“contents:”

Columns

Rows

“www.cnn.com”
“<html>…”

• Rows are ordered lexicographically
• Good match for most of our applications
Tuesday, September 10, 13
BigTable: Basic Data Model
• Distributed multi-dimensional sparse map
(row, column, timestamp) → cell contents
“contents:”

Columns

Rows

“www.cnn.com”
“<html>…”

t17
Timestamps

• Rows are ordered lexicographically
• Good match for most of our applications
Tuesday, September 10, 13
BigTable: Basic Data Model
• Distributed multi-dimensional sparse map
(row, column, timestamp) → cell contents
“contents:”

Columns

Rows
t11

“www.cnn.com”
“<html>…”

t17

Timestamps

• Rows are ordered lexicographically
• Good match for most of our applications
Tuesday, September 10, 13
BigTable: Basic Data Model
• Distributed multi-dimensional sparse map
(row, column, timestamp) → cell contents
“contents:”

Columns

Rows
t11

“www.cnn.com”
“<html>…”

t3

t17

Timestamps

• Rows are ordered lexicographically
• Good match for most of our applications
Tuesday, September 10, 13
Tablets & Splitting
“language:”

“contents:”

EN

“<html>…”

“aaa.com”
“cnn.com”
“cnn.com/sports.html”

…
“website.com”

…
“zuppa.com/menu.html”

Tuesday, September 10, 13
Tablets & Splitting
“language:”

“contents:”

EN

“<html>…”

“aaa.com”
“cnn.com”
“cnn.com/sports.html”

Tablets
…
“website.com”

…
“zuppa.com/menu.html”

Tuesday, September 10, 13
Tablets & Splitting
“language:”

“contents:”

EN

“<html>…”

“aaa.com”
“cnn.com”
“cnn.com/sports.html”

Tablets
…
“website.com”

…
“yahoo.com/kids.html”
…

“yahoo.com/kids.html0”

…
“zuppa.com/menu.html”

Tuesday, September 10, 13
Tablets & Splitting
“language:”

“contents:”

EN

“<html>…”

“aaa.com”
“cnn.com”
“cnn.com/sports.html”

Tablets
…
“website.com”

…
“yahoo.com/kids.html”
…

“yahoo.com/kids.html0”

…
“zuppa.com/menu.html”

Tuesday, September 10, 13
BigTable System Structure
Bigtable Cell
Bigtable master

Bigtable tablet server

Tuesday, September 10, 13

Bigtable tablet server

…

Bigtable tablet server
BigTable System Structure
Bigtable Cell
Bigtable master
performs metadata ops +
load balancing
Bigtable tablet server

Tuesday, September 10, 13

Bigtable tablet server

…

Bigtable tablet server
BigTable System Structure
Bigtable Cell
Bigtable master
performs metadata ops +
load balancing
Bigtable tablet server

Bigtable tablet server

serves data

serves data

Tuesday, September 10, 13

…

Bigtable tablet server
serves data
BigTable System Structure
Bigtable Cell
Bigtable master
performs metadata ops +
load balancing
Bigtable tablet server

Bigtable tablet server

serves data

serves data

Cluster scheduling system

Tuesday, September 10, 13

Cluster file system

…

Bigtable tablet server
serves data

Lock service
BigTable System Structure
Bigtable Cell
Bigtable master
performs metadata ops +
load balancing
Bigtable tablet server

Bigtable tablet server

serves data

serves data

Cluster scheduling system
schedules tasks onto machines

Tuesday, September 10, 13

Cluster file system

…

Bigtable tablet server
serves data

Lock service
BigTable System Structure
Bigtable Cell
Bigtable master
performs metadata ops +
load balancing
Bigtable tablet server

Bigtable tablet server

serves data

serves data

…

Cluster scheduling system

Cluster file system

schedules tasks onto machines

holds tablet data, logs

Tuesday, September 10, 13

Bigtable tablet server
serves data

Lock service
BigTable System Structure
Bigtable Cell
Bigtable master
performs metadata ops +
load balancing
Bigtable tablet server

Bigtable tablet server

serves data

serves data

Cluster scheduling system
schedules tasks onto machines

Tuesday, September 10, 13

…

Bigtable tablet server
serves data

Cluster file system

Lock service

holds tablet data, logs

holds metadata,
handles master-election
BigTable System Structure
Bigtable client

Bigtable Cell

Bigtable client
library
Bigtable master
performs metadata ops +
load balancing

Bigtable tablet server

Bigtable tablet server

serves data

serves data

Cluster scheduling system
schedules tasks onto machines

Tuesday, September 10, 13

…

Bigtable tablet server
serves data

Cluster file system

Lock service

holds tablet data, logs

holds metadata,
handles master-election
BigTable System Structure
Bigtable client

Bigtable Cell

Bigtable client
library
Bigtable master
performs metadata ops +
load balancing

Bigtable tablet server

Bigtable tablet server

serves data

Open()

serves data

Cluster scheduling system
schedules tasks onto machines

Tuesday, September 10, 13

…

Bigtable tablet server
serves data

Cluster file system

Lock service

holds tablet data, logs

holds metadata,
handles master-election
BigTable System Structure
Bigtable client

Bigtable Cell

Bigtable client
library
Bigtable master
performs metadata ops +
load balancing

Bigtable tablet server

Bigtable tablet server

serves data

read/write

serves data

Cluster scheduling system
schedules tasks onto machines

Tuesday, September 10, 13

…

Open()

Bigtable tablet server
serves data

Cluster file system

Lock service

holds tablet data, logs

holds metadata,
handles master-election
BigTable System Structure
Bigtable client

Bigtable Cell

metadata ops

Bigtable client
library

Bigtable master
performs metadata ops +
load balancing
Bigtable tablet server

Bigtable tablet server

serves data

read/write

serves data

Cluster scheduling system
schedules tasks onto machines

Tuesday, September 10, 13

…

Open()

Bigtable tablet server
serves data

Cluster file system

Lock service

holds tablet data, logs

holds metadata,
handles master-election
BigTable Status
• Production use for 100s of projects:
– Crawling/indexing pipeline, Google Maps/Google Earth/Streetview,
Search History, Google Print, Google+, Blogger, ...

• Currently 500+ BigTable clusters
• Largest cluster:
– 100s PB data; sustained: 30M ops/sec; 100+ GB/s I/O

• Many asynchronous processes updating different
pieces of information
–no distributed transactions, no cross-row joins
–initial design was just in a single cluster
–follow-on work added eventual consistency across
many geographically distributed BigTable instances

Tuesday, September 10, 13
Spanner
• Storage & computation system that runs across many datacenters
– single global namespace
• names are independent of location(s) of data
• fine-grained replication configurations

– support mix of strong and weak consistency across datacenters
• Strong consistency implemented with Paxos across tablet replicas
• Full support for distributed transactions across directories/machines

– much more automated operation
• automatically changes replication based on constraints and usage patterns
• automated allocation of resources across entire fleet of machines

Tuesday, September 10, 13
Design Goals for Spanner
• Future scale: ~105 to 107 machines, ~1013 directories,
~1018 bytes of storage, spread at 100s to 1000s of
locations around the world

– zones of semi-autonomous control
– consistency after disconnected operation
– users specify high-level desires:
“99%ile latency for accessing this data should be <50ms”
“Store this data on at least 2 disks in EU, 2 in U.S. & 1 in Asia”
Tuesday, September 10, 13
Spanner Lessons
• Several variations of eventual client API
• Started to develop with many possible customers in mind, but no
particular customer we were working closely with
• Eventually we worked closely with Google ads system as initial
customer
– first real customer was very demanding (real $$): good and bad
• Different API than BigTable
– Harder to move users with existing heavy BigTable usage

Tuesday, September 10, 13
Designing & Building Infrastructure
Identify common problems, and build software systems to
address them in a general way
• Important to not try to be all things to all people
– Clients might be demanding 8 different things
– Doing 6 of them is easy
– …handling 7 of them requires real thought
– …dealing with all 8 usually results in a worse system
• more complex, compromises other clients in trying to satisfy
everyone

Tuesday, September 10, 13
Designing & Building Infrastructure (cont)
Don't build infrastructure just for its own sake:
• Identify common needs and address them
• Don't imagine unlikely potential needs that aren't really there
Best approach: use your own infrastructure (especially at first!)
• (much more rapid feedback about what works, what doesn't)

If not possible, at least work very closely with initial client team
• ideally sit within 50 feet of each other
• keep other potential clients needs in mind, but get system
working via close collaboration with first client first

Tuesday, September 10, 13
Thanks!
Further reading:
• Ghemawat, Gobioff, & Leung. Google File System, SOSP 2003.
• Barroso, Dean, & Hölzle. Web Search for a Planet: The Google Cluster Architecture, IEEE Micro, 2003.
• Dean & Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004.
• Chang, Dean, Ghemawat, Hsieh, Wallach, Burrows, Chandra, Fikes, & Gruber. Bigtable: A Distributed
Storage System for Structured Data, OSDI 2006.

• Corbett et al.
• Burrows.

Spanner: Google’s Globally Distributed Database, OSDI 2012.

The Chubby Lock Service for Loosely-Coupled Distributed Systems. OSDI 2006.

• Pinheiro, Weber, & Barroso. Failure Trends in a Large Disk Drive Population.

FAST 2007.

• Barroso & Hölzle.

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale
Machines, Morgan & Claypool Synthesis Series on Computer Architecture, 2009.

• Malewicz et al.

Pregel: A System for Large-Scale Graph Processing. PODC, 2009.

• Schroeder, Pinheiro, & Weber.
• Protocol Buffers.

DRAM Errors in the Wild: A Large-Scale Field Study. SEGMETRICS’09.

http://code.google.com/p/protobuf/

See: http://research.google.com/papers.html
http://research.google.com/people/jeff
Tuesday, September 10, 13

More Related Content

What's hot

What's hot (20)

Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Looking ahead at PostgreSQL 15
Looking ahead at PostgreSQL 15Looking ahead at PostgreSQL 15
Looking ahead at PostgreSQL 15
 
Introduction to the Disruptor
Introduction to the DisruptorIntroduction to the Disruptor
Introduction to the Disruptor
 
Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Redis
RedisRedis
Redis
 
Postgresql Database Administration Basic - Day1
Postgresql  Database Administration Basic  - Day1Postgresql  Database Administration Basic  - Day1
Postgresql Database Administration Basic - Day1
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
PostgreSQL Deep Internal
PostgreSQL Deep InternalPostgreSQL Deep Internal
PostgreSQL Deep Internal
 
MongoDB World 2019: The Journey of Migration from Oracle to MongoDB at Rakuten
MongoDB World 2019: The Journey of Migration from Oracle to MongoDB at RakutenMongoDB World 2019: The Journey of Migration from Oracle to MongoDB at Rakuten
MongoDB World 2019: The Journey of Migration from Oracle to MongoDB at Rakuten
 
Megastore - ID2220 Presentation
Megastore - ID2220 PresentationMegastore - ID2220 Presentation
Megastore - ID2220 Presentation
 
The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
MongoDB Aggregation Performance
MongoDB Aggregation PerformanceMongoDB Aggregation Performance
MongoDB Aggregation Performance
 

Similar to Google jeff dean lessons learned while building infrastructure software at google

Proud to be polyglot!
Proud to be polyglot!Proud to be polyglot!
Proud to be polyglot!
NLJUG
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
programmermag
 
Google data centers
Google data centersGoogle data centers
Google data centers
Ali Al-Ugali
 

Similar to Google jeff dean lessons learned while building infrastructure software at google (20)

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
Proud to be polyglot!
Proud to be polyglot!Proud to be polyglot!
Proud to be polyglot!
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
Databases for Data Science
Databases for Data ScienceDatabases for Data Science
Databases for Data Science
 
Treasure Data Cloud Strategy
Treasure Data Cloud StrategyTreasure Data Cloud Strategy
Treasure Data Cloud Strategy
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
Lipstick On Pig
Lipstick On Pig Lipstick On Pig
Lipstick On Pig
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Introduction to Accumulo
Introduction to AccumuloIntroduction to Accumulo
Introduction to Accumulo
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
WebServices_Grid.ppt
WebServices_Grid.pptWebServices_Grid.ppt
WebServices_Grid.ppt
 
Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel   Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
Big table
Big tableBig table
Big table
 
Google data centers
Google data centersGoogle data centers
Google data centers
 
Data sync on iOS with Couchbase Mobile
Data sync on iOS with Couchbase MobileData sync on iOS with Couchbase Mobile
Data sync on iOS with Couchbase Mobile
 
Greenplum Architecture
Greenplum ArchitectureGreenplum Architecture
Greenplum Architecture
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Google jeff dean lessons learned while building infrastructure software at google

  • 1. Lessons Learned While Building Infrastructure Software at Google Jeff Dean jeff@google.com Tuesday, September 10, 13
  • 2. “Google” Circa 1997 (google.stanford.edu) Tuesday, September 10, 13
  • 4. Google Data Center (2000) Tuesday, September 10, 13
  • 5. Google Data Center (2000) Tuesday, September 10, 13
  • 6. Google Data Center (2000) Tuesday, September 10, 13
  • 7. Google (new data center 2001) Tuesday, September 10, 13
  • 8. Google Data Center (3 days later) Tuesday, September 10, 13
  • 9. Google’s Computational Environment Today • Many datacenters around the world Tuesday, September 10, 13
  • 10. Google’s Computational Environment Today • Many datacenters around the world Tuesday, September 10, 13
  • 12. Lots of machines... Tuesday, September 10, 13
  • 14. Low-Level Systems Software Desires • If you have lots of machines, you want to: • Store data persistently – w/ high availability – high read and write bandwidth • Run large-scale computations reliably – without having to deal with machine failures • GFS, MapReduce, BigTable, Spanner, ... Tuesday, September 10, 13
  • 15. Masters C0 C5 C1 C2 Chunkserver 1 • • • • Replicas Google File System (GFS) Design Client GFS Master Client Client C1 C5 Misc. servers GFS Master C3 Chunkserver 2 C0 … C5 C2 Chunkserver N Master manages metadata Data transfers are directly between clients/chunkservers Files broken into chunks (typically 64 MB) Chunks replicated across multiple machines (usually 3) Tuesday, September 10, 13
  • 16. GFS Motivation and Lessons • Indexing system clearly needed a large-scale distributed file system – wanted to treat whole cluster as single file system • Developed by subset of same people working on indexing system • Identified minimal set of features needed – e.g. Not POSIX compliant – actual data was distributed, but kept metadata centralized • Colossus: Follow-on system developed many years later distributed the metadata • Lesson: Don’t solve everything all at once Tuesday, September 10, 13
  • 17. MapReduce History • 2003: Sanjay Ghemawat and I were working on rewriting indexing system: – starts with raw page contents on disk – many phases: • (near) duplicate elimination, anchor text extraction, language identification, index shard generation, etc. – end result is data structures for index and doc serving • Each phase was hand written parallel computation: – hand parallelized – hand-written checkpointing code for fault-tolerance Tuesday, September 10, 13
  • 18. MapReduce • A simple programming model that applies to many large-scale computing problems – allowed us to express all phases of our indexing system – since used across broad range of computer science areas, plus other scientific fields – Hadoop open-source implementation seeing significant usage • Hide messy details in MapReduce runtime library: – automatic parallelization – load balancing – network and disk transfer optimizations – handling of machine failures – robustness – improvements to core library benefit all users of library! Tuesday, September 10, 13
  • 19. Typical problem solved by MapReduce • • • • • Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize, filter, or transform Write the results Outline stays the same, User writes Map and Reduce functions to fit the problem Tuesday, September 10, 13
  • 20. MapReduce Motivation and Lessons • Developed by two people that were also doing the indexing system rewrite – squinted at various phases with an eye towards coming up with common abstraction • Initial version developed quickly – proved initial API utility with very simple implementation – rewrote much of implementation 6 months later to add lots of the performance wrinkles/tricks that appeared in original paper • Lesson: Very close ties with initial users of system make things happen faster – in this case, we were both building MapReduce and using it simultaneously Tuesday, September 10, 13
  • 21. BigTable: Motivation • Lots of (semi-)structured data at Google – URLs: Contents, crawl metadata, links, anchors, pagerank, … – Per-user data: User preferences, recent queries, … – Geographic locations: Physical entities, roads, satellite image data, user annotations, … • Scale is large • Want to be able to grow and shrink resources devoted to system as needed Tuesday, September 10, 13
  • 22. BigTable: Basic Data Model • Distributed multi-dimensional sparse map (row, column, timestamp) → cell contents Columns Rows • Rows are ordered lexicographically • Good match for most of our applications Tuesday, September 10, 13
  • 23. BigTable: Basic Data Model • Distributed multi-dimensional sparse map (row, column, timestamp) → cell contents “contents:” Columns Rows “www.cnn.com” “<html>…” • Rows are ordered lexicographically • Good match for most of our applications Tuesday, September 10, 13
  • 24. BigTable: Basic Data Model • Distributed multi-dimensional sparse map (row, column, timestamp) → cell contents “contents:” Columns Rows “www.cnn.com” “<html>…” t17 Timestamps • Rows are ordered lexicographically • Good match for most of our applications Tuesday, September 10, 13
  • 25. BigTable: Basic Data Model • Distributed multi-dimensional sparse map (row, column, timestamp) → cell contents “contents:” Columns Rows t11 “www.cnn.com” “<html>…” t17 Timestamps • Rows are ordered lexicographically • Good match for most of our applications Tuesday, September 10, 13
  • 26. BigTable: Basic Data Model • Distributed multi-dimensional sparse map (row, column, timestamp) → cell contents “contents:” Columns Rows t11 “www.cnn.com” “<html>…” t3 t17 Timestamps • Rows are ordered lexicographically • Good match for most of our applications Tuesday, September 10, 13
  • 31. BigTable System Structure Bigtable Cell Bigtable master Bigtable tablet server Tuesday, September 10, 13 Bigtable tablet server … Bigtable tablet server
  • 32. BigTable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Bigtable tablet server Tuesday, September 10, 13 Bigtable tablet server … Bigtable tablet server
  • 33. BigTable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Bigtable tablet server Bigtable tablet server serves data serves data Tuesday, September 10, 13 … Bigtable tablet server serves data
  • 34. BigTable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Bigtable tablet server Bigtable tablet server serves data serves data Cluster scheduling system Tuesday, September 10, 13 Cluster file system … Bigtable tablet server serves data Lock service
  • 35. BigTable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Bigtable tablet server Bigtable tablet server serves data serves data Cluster scheduling system schedules tasks onto machines Tuesday, September 10, 13 Cluster file system … Bigtable tablet server serves data Lock service
  • 36. BigTable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Bigtable tablet server Bigtable tablet server serves data serves data … Cluster scheduling system Cluster file system schedules tasks onto machines holds tablet data, logs Tuesday, September 10, 13 Bigtable tablet server serves data Lock service
  • 37. BigTable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Bigtable tablet server Bigtable tablet server serves data serves data Cluster scheduling system schedules tasks onto machines Tuesday, September 10, 13 … Bigtable tablet server serves data Cluster file system Lock service holds tablet data, logs holds metadata, handles master-election
  • 38. BigTable System Structure Bigtable client Bigtable Cell Bigtable client library Bigtable master performs metadata ops + load balancing Bigtable tablet server Bigtable tablet server serves data serves data Cluster scheduling system schedules tasks onto machines Tuesday, September 10, 13 … Bigtable tablet server serves data Cluster file system Lock service holds tablet data, logs holds metadata, handles master-election
  • 39. BigTable System Structure Bigtable client Bigtable Cell Bigtable client library Bigtable master performs metadata ops + load balancing Bigtable tablet server Bigtable tablet server serves data Open() serves data Cluster scheduling system schedules tasks onto machines Tuesday, September 10, 13 … Bigtable tablet server serves data Cluster file system Lock service holds tablet data, logs holds metadata, handles master-election
  • 40. BigTable System Structure Bigtable client Bigtable Cell Bigtable client library Bigtable master performs metadata ops + load balancing Bigtable tablet server Bigtable tablet server serves data read/write serves data Cluster scheduling system schedules tasks onto machines Tuesday, September 10, 13 … Open() Bigtable tablet server serves data Cluster file system Lock service holds tablet data, logs holds metadata, handles master-election
  • 41. BigTable System Structure Bigtable client Bigtable Cell metadata ops Bigtable client library Bigtable master performs metadata ops + load balancing Bigtable tablet server Bigtable tablet server serves data read/write serves data Cluster scheduling system schedules tasks onto machines Tuesday, September 10, 13 … Open() Bigtable tablet server serves data Cluster file system Lock service holds tablet data, logs holds metadata, handles master-election
  • 42. BigTable Status • Production use for 100s of projects: – Crawling/indexing pipeline, Google Maps/Google Earth/Streetview, Search History, Google Print, Google+, Blogger, ... • Currently 500+ BigTable clusters • Largest cluster: – 100s PB data; sustained: 30M ops/sec; 100+ GB/s I/O • Many asynchronous processes updating different pieces of information –no distributed transactions, no cross-row joins –initial design was just in a single cluster –follow-on work added eventual consistency across many geographically distributed BigTable instances Tuesday, September 10, 13
  • 43. Spanner • Storage & computation system that runs across many datacenters – single global namespace • names are independent of location(s) of data • fine-grained replication configurations – support mix of strong and weak consistency across datacenters • Strong consistency implemented with Paxos across tablet replicas • Full support for distributed transactions across directories/machines – much more automated operation • automatically changes replication based on constraints and usage patterns • automated allocation of resources across entire fleet of machines Tuesday, September 10, 13
  • 44. Design Goals for Spanner • Future scale: ~105 to 107 machines, ~1013 directories, ~1018 bytes of storage, spread at 100s to 1000s of locations around the world – zones of semi-autonomous control – consistency after disconnected operation – users specify high-level desires: “99%ile latency for accessing this data should be <50ms” “Store this data on at least 2 disks in EU, 2 in U.S. & 1 in Asia” Tuesday, September 10, 13
  • 45. Spanner Lessons • Several variations of eventual client API • Started to develop with many possible customers in mind, but no particular customer we were working closely with • Eventually we worked closely with Google ads system as initial customer – first real customer was very demanding (real $$): good and bad • Different API than BigTable – Harder to move users with existing heavy BigTable usage Tuesday, September 10, 13
  • 46. Designing & Building Infrastructure Identify common problems, and build software systems to address them in a general way • Important to not try to be all things to all people – Clients might be demanding 8 different things – Doing 6 of them is easy – …handling 7 of them requires real thought – …dealing with all 8 usually results in a worse system • more complex, compromises other clients in trying to satisfy everyone Tuesday, September 10, 13
  • 47. Designing & Building Infrastructure (cont) Don't build infrastructure just for its own sake: • Identify common needs and address them • Don't imagine unlikely potential needs that aren't really there Best approach: use your own infrastructure (especially at first!) • (much more rapid feedback about what works, what doesn't) If not possible, at least work very closely with initial client team • ideally sit within 50 feet of each other • keep other potential clients needs in mind, but get system working via close collaboration with first client first Tuesday, September 10, 13
  • 48. Thanks! Further reading: • Ghemawat, Gobioff, & Leung. Google File System, SOSP 2003. • Barroso, Dean, & Hölzle. Web Search for a Planet: The Google Cluster Architecture, IEEE Micro, 2003. • Dean & Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004. • Chang, Dean, Ghemawat, Hsieh, Wallach, Burrows, Chandra, Fikes, & Gruber. Bigtable: A Distributed Storage System for Structured Data, OSDI 2006. • Corbett et al. • Burrows. Spanner: Google’s Globally Distributed Database, OSDI 2012. The Chubby Lock Service for Loosely-Coupled Distributed Systems. OSDI 2006. • Pinheiro, Weber, & Barroso. Failure Trends in a Large Disk Drive Population. FAST 2007. • Barroso & Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Morgan & Claypool Synthesis Series on Computer Architecture, 2009. • Malewicz et al. Pregel: A System for Large-Scale Graph Processing. PODC, 2009. • Schroeder, Pinheiro, & Weber. • Protocol Buffers. DRAM Errors in the Wild: A Large-Scale Field Study. SEGMETRICS’09. http://code.google.com/p/protobuf/ See: http://research.google.com/papers.html http://research.google.com/people/jeff Tuesday, September 10, 13