Google's architecture allows it to scale to support millions of users through:
1. Caching content at the perimeter to reduce load on backend systems.
2. Distributing content and computations across hundreds of data centers worldwide containing thousands of modular server racks.
3. Custom Linux kernels and software stacks optimized for Google's unique needs, like the Google File System for storage and MapReduce for computations.
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Open Talk Series at Aditi illuminates new ideas
1. Learning and Development Be part of the learning experience at Aditi.
presents
Join the talks. Its free.
Free as in freedom at work, not free-beer.
Its not training. Its mind-opener.
Speak at these events. Or bring an
expert/friend to talk.
Open Talk Series
Mail OpenTalk@aditi.com with topic and
A series of illuminating talks and
interactions that open our minds to new availability.
ideas and concepts; that makes us look for
newer or better ways of doing what we
did; or point us to exciting things we have
never done before. A range of topics on Usually at 4.30PM Wednesdays.
Technology, Business, Fun and Life.
2. HOW TO ENJOY AN TALK
Bring coffee & friends Switch OFF mobile Switch ON mind
Sign attendance sheet SHARE your wisdom QUESTION notions
THANK the Talker SPREAD the good word
3. New Champion
Sahil Sagar
Aditi Technologies | Partnering Innovation
4. Agenda
• We are not talking about crawler
• No discussion on PageRank… maybe?
4
Aditi Technologies | Partnering Innovation
5. The art of scale
10-50 users 100-500 users 500-10000
5
Aditi Technologies | Partnering Innovation
6. Scale ????
800,000 Machines
Largest Linux
Base
6
Aditi Technologies | Partnering Innovation
7. • What gives us this scale?
Good Code?
More servers?
Powerful
Servers?
7
Aditi Technologies | Partnering Innovation
8. • Lets see what gives Google the scale
Architecture
GOOGLE APPS
SEARCH
GOOGLE APP
INDEX
ENGINE
CRAWL The apps on top
GMAIL...
Python. Java. Python, Java, C++, of it.
C++ Sawzall, other
GWQ
Mapreduce
BigTable
BigTable The Secret Sauce
Chubby Lock
GFS / GFS II
INTERIOR NETWORK IPv6
RHEL 2.6.X PAE
Infrastructure
SERVER HARDWARE
RACK
DC
Exterior Network
8
Aditi Technologies | Partnering Innovation
9. Scale in Google
Architecture
GOOGLE APPS
SEARCH
GOOGLE APP
INDEX
ENGINE
CRAWL
GMAIL...
Python. Java. Python, Java, C++,
C++ Sawzall, other
1. The first touch
GWQ
Mapreduce
2. Size does matter
BigTable
BigTable
Chubby Lock
3. The Safe
GFS / GFS II
4. Operating System Implementation
INTERIOR NETWORK IPv6
RHEL 2.6.X PAE 5. Interior Network Architecture
SERVER HARDWARE
RACK
DC
Exterior Network
9
Aditi Technologies | Partnering Innovation
10. The first touch to the services
10
Aditi Technologies | Partnering Innovation
11. The first touch to the service
Architecture
GOOGLE APPS
SEARCH
GOOGLE APP
ENGINE
INDEX
CRAWL Client Browser Firewall
DMZ
GMAIL... 80/443 80/443
Perimeter Firewall
Python. Java. Python, Java, C++,
C++ Sawzall, other
GWQ
BigTable Mapreduce Squid GWS
BigTable Reverse Proxy Web Server Farm
Chubby Lock
NetScalar
http multiplexing Cell
Interior Network
GFS II etc
GFS / GFS II
INTERIOR NETWORK IPv6
RHEL 2.6.X PAE
SERVER HARDWARE
RACK
DC
Exterior Network
11
Aditi Technologies | Partnering Innovation
12. The touch is not always real
Architecture
GOOGLE APPS
SEARCH
GOOGLE APP
INDEX
ENGINE
CRAWL 80/443 80/443
GMAIL...
Python. Java. Python, Java, C++,
C++ Sawzall, other
GWQ
Squid
Reverse Proxy
BigTable Mapreduce
BigTable
Chubby Lock • Uses Squid Reverse Proxy
• Perimeter Cache hit rates 30-60% = Huge!
GFS / GFS II
• Dependent on search complexity/user preferences/traffic
INTERIOR NETWORK IPv6
type
RHEL 2.6.X PAE
• All Image Thumbnails caches, much Multimedia cached
SERVER HARDWARE
RACK
• Expensive common queries cached (common words like
DC
‘Obama‘) as they require significant back-end processing.
Exterior Network 12
Aditi Technologies | Partnering Innovation
14. Worldwide Data Centres
Architecture
GOOGLE APPS
SEARCH
GOOGLE APP
INDEX
ENGINE
CRAWL
GMAIL...
Python. Java. Python, Java, C++,
C++ Sawzall, other
GWQ
BigTable Mapreduce
BigTable
Chubby Lock
GFS / GFS II
INTERIOR NETWORK IPv6
RHEL 2.6.X PAE
SERVER HARDWARE
RACK Last estimated were 36 Data Centers, 300+ GFSII Clusters and upwards of
DC 800K machines.
Exterior Network
14
Aditi Technologies | Partnering Innovation
15. The Modular Data Centre
Architecture
GOOGLE APPS
SEARCH
GOOGLE APP
INDEX
ENGINE
CRAWL
GMAIL...
Python. Java. Python, Java, C++,
C++ Sawzall, other
GWQ
BigTable Mapreduce
BigTable
Chubby Lock
GFS / GFS II Standard Google Modular DC (Cell) holds 1160 Servers / 250KW Power
Consumption in 30 racks (40U).
INTERIOR NETWORK IPv6
RHEL 2.6.X PAE This is the “Atomic“ Data Centre Building Block of Google.
SERVER HARDWARE A Data Centre would consist of 100‘s of Modular Cells.
RACK
DC
Exterior Network
15
Aditi Technologies | Partnering Innovation
16. THE Safe
How is a server stored in the Data Centre?
16
Aditi Technologies | Partnering Innovation
17. Google Rack (GOOG rack)
Architecture
EVERYTHING custom!
GOOGLE APPS
SEARCH
GOOGLE APP
INDEX
ENGINE
CRAWL
GMAIL... • Optimized Motherboards
Python. Java.
C++
Python, Java, C++,
Sawzall, other • Have their own HW builds
GWQ • Build redundancy on top of
failure
BigTable Mapreduce
BigTable • Motherboard directly
Chubby Lock
mounted into Rack
• Servers have no casing -
GFS / GFS II
just bare boards
• Assist with heat dispersal
INTERIOR NETWORK IPv6
issues
RHEL 2.6.X PAE
SERVER HARDWARE
RACK
DC
Exterior Network 17
Aditi Technologies | Partnering Innovation
18. THE OPERATING SYSTEM
The Core Software on each of those servers
18
Aditi Technologies | Partnering Innovation
19. OPERATING SYSTEM
Architecture
GOOGLE APPS
GOOGLE APP
SEARCH
INDEX
-100% Redhat Linux Based since 1998 inception
ENGINE
CRAWL
GMAIL...
Python. Java. Python, Java, C++, - RHEL
C++ Sawzall, other
- 2.6.X Kernel
GWQ
- PAE
- Custom glibc.. rpc... ipvs...
Mapreduce
- Custom FS (GFS II)
BigTable
BigTable - Custom Kerberos
Chubby Lock - Custom NFS
- Custom CUPS
- Custom gPXE bootloader
- Custom EVERYTHING.....
GFS / GFS II
INTERIOR NETWORK IPv6 Kernel/Subsystem Modifications
tcmalloc – replaces glibc 2.3 malloc – much faster! works very well with threads...
RHEL 2.6.X PAE rpc – the rpc layer extensively modified to provide > perf increase < latency (52%/40%)
SERVER HARDWARE
Significantly modified Kernel and Subsystems – all IPv6 enabled
RACK
DC
Exterior Network
19
Aditi Technologies | Partnering Innovation
21. Section II – Googles Major Glue
Architecture
GOOGLE APPS
SEARCH
GOOGLE APP
INDEX
ENGINE
CRAWL
GMAIL...
Python. Java. Python, Java, C++,
C++ Sawzall, other
GWQ
1. Google File System Architecture – GFS II
BigTable Mapreduce
BigTable
Chubby Lock 2. Google Database - Bigtable
3. Google Computation - Mapreduce
GFS / GFS II
INTERIOR NETWORK IPv6
RHEL 2.6.X PAE
SERVER HARDWARE
RACK
DC
Exterior Network
21
Aditi Technologies | Partnering Innovation
22. GOOGLE FILE SYSTEM
Manages the underlying Data on behalf of the upper layers
and ultimately the applications
22
Aditi Technologies | Partnering Innovation
23. GFS versus NFS
Network File System (NFS) Google File System (GFS)
• Single machine makes part of Single virtual file system spread over
its file system available to many machines
other machines Optimized for sequential read
• Sequential or random access and local accesses
• PRO: Simplicity, generality, PRO: High throughput, high
transparency capacity
• CON: Storage capacity and "CON": Specialized for particular
throughput limited by single types of applications
server
23 University of Pennsylvania
Aditi Technologies | Partnering Innovation
24. FILE SYSTEM I – GFS II
Architecture
GOOGLE APPS
SEARCH
GOOGLE APP
INDEX
ENGINE
CRAWL
GMAIL...
Python. Java. Python, Java, C++,
C++ Sawzall, other
GWQ
BigTable Mapreduce
BigTable
Chubby Lock
GFS / GFS II
INTERIOR NETWORK IPv6
RHEL 2.6.X PAE
Elegant Master Failover
SERVER HARDWARE Chunk Size is now 1MB
RACK Only ever lost one 64MB chunk (in GFS I) during its entire production deployment so
DC assumed extremely reliable
Exterior Network 24
Aditi Technologies | Partnering Innovation
25. CAP Theorem
(Brewer's theorem)
• Consistency: All nodes see the same data at the same
time
• Availability: Node failures do not prevent survivors
from continuing to operate
• Partition tolerance: The system continues to operate
despite arbitrary message loss
25
Aditi Technologies | Partnering Innovation
26. GOOGLE DATABASE
Accesses the underlying Data on behalf of the upper layers
and ultimately the applications
26
Aditi Technologies | Partnering Innovation
27. Why not commercial DB?
• Scale is too large for most commercial databases
• Cost would be very high
– Building internally means system can be applied
across many projects for low incremental cost
• Low-level storage optimizations help
performance significantly
– Much harder to do when running on top of a database
layer
“Also fun and challenging to build large-scale
systems”
27
Aditi Technologies | Partnering Innovation
28. BigTable
• A distributed storage system for managing structured data.
• Scalable
– Thousands of servers
– Terabytes of in-memory data
– Petabyte of disk-based data
– Millions of reads/writes per second, efficient scans
• Self-managing
– Servers can be added/removed dynamically
– Servers adjust to load imbalance
• Used for many Google projects
– Web indexing, Personalized Search, Google Earth, Google Analytics,
Google Finance, …
28
Aditi Technologies | Partnering Innovation
29. BigTable
• Physically sorted on row-key – like a row-store
• Column families - like column-stores
• Variable (record-by-record) columns within a column family
• Column-values versioned; stored in reverse chronological order
• Designed to store hyperlink structure of web
Aditi Technologies | Partnering Innovation
30. GOOGLE MAPREDUCE
Computes the underlying Data on behalf of the applications
30
Aditi Technologies | Partnering Innovation
31. Mapreduce I
Architecture
GOOGLE APPS
SEARCH
GOOGLE APP
ENGINE
INDEX
CRAWL
Map Reduction can be seen as a way to exploit massive parallelism
GMAIL... by breaking a task down into constituent parts and executing on
Python. Java. Python, Java, C++,
C++ Sawzall, other multiple processors
GWQ
The Major Functions are MAP & REDUCE (with a number of intermediatary steps
BigTable Mapreduce MAP Break task down into parallel steps
BigTable
Chubby Lock REDUCE Combine results into final output
GFS / GFS II
INTERIOR NETWORK IPv6
RHEL 2.6.X PAE
SERVER HARDWARE
Shown is a 2-pipeline Map Reduction (There are 24 Map Reductions in the indexing pipeline)
RACK Mappers & Reducers usually run on separate processors (90% loss of reducers job still completed!)
DC
Exterior Network
31
Aditi Technologies | Partnering Innovation
32. Word-Count using MapReduce
Problem: determine the frequency of each word in a large
document collection
Aditi Technologies | Partnering Innovation
33. What runs on top of all this
33
Aditi Technologies | Partnering Innovation
34. PageRank: Intuition Shouldn't E's vote be
worth more than F's?
G A
H E B
How many levels I C
should we consider? F
J D
• Imagine a contest for The Web's Best Page
– Initially, each page has one vote
– Each page votes for all the pages it has a link to
– To ensure fairness, pages voting for more than one page must
split their vote equally between them
– Voting proceeds in rounds; in each round, each page has the
number of votes it received in the previous round
– In practice, it's a little more complicated - but not much!
34
Aditi Technologies | Partnering Innovation
35. Random Surfer Model
• PageRank has an intuitive basis in random walks
on graphs
• Imagine a random surfer, who starts on a random
page and, in each step,
– with probability d, clicks on a random link on the page
– with probability 1-d, jumps to a random page (bored?)
• The PageRank of a page can be interpreted as the
fraction of steps the surfer spends on the
corresponding page
35
Aditi Technologies | Partnering Innovation
36. BUILD YOUR OWN GOOGLE
The Basic Open Source Tools
36
Aditi Technologies | Partnering Innovation
37. The Google Stack (vs Yahoo‘ish/Open Source)
Open Source
(Yahoo’ish)
Architecture Architecture
GOOGLE APPS
SEARCH
APP ENGINE INDEX CLIENT APPLICATION
CRAWL
GMAIL...
Python, Java, Python, Java, C++, Pig Latin, Python, PHP, Java ....
C++, Sawzall, other anything
Task Queue GWQ Job Tracker
Googles Mapreduce Hadoop Framework
Hadoop
BigTable
Secret Sauce
BigTable
Chubby Lock
Mapreduce
Hbase (Bigtable equiv.)
Open Source
(Other Tools such as crawlers, indexers readily available)
GFS / GFS II HDFS (hadoop)
INTERIOR NETWORK IPv6 INTERIOR NETWORK IPv6
RHEL 2.6.X PAE CentOS 2.6.X PAE
SERVER HARDWARE SERVER HARDWARE
RACK RACK
DC DC
Exterior Network Exterior Network
Conceptual Overview
Google vs. Open Source 37
Aditi Technologies | Partnering Innovation
38. END
(Thankyou)
38
Aditi Technologies | Partnering Innovation
39. Pre Presentation
The Google Philosophy (according to ed)
• Jedis build their own lightsabres (the MS Eat your own Dog Food)
• Parallelize Everything
• Distribute Everything (to atomic level if possible)
• Compress Everything (CPU cheaper than bandwidth)
• Secure Everything (you can never be too paranoid)
• Cache (almost) Everything
• Redundantize Everything (in triplicate usually)
• Latency is VERY evil
39
Aditi Technologies | Partnering Innovation
40. Special Thanks to ….
The Anatomy of the Google Architecture
“The unofficial Version“
V1.0 November 2009
• Ed Austin
• {ed, edik} @i-dot.com
Aditi Technologies | Partnering Innovation
41.
42. Keep Learning
For any suggestions on topics/ feedbacks etc.,
Contact OpenTalk@aditi.com