SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Chapter Three: Google Technology




Chapter Three:

                        Google Technology




     “Apart from the problems of scaling traditional search techniques to data of this
     magnitude, there are new technical challenges involved with using the additional
     information present in hypertext to product better search results.... Fast crawling
     technology is needed to gather the Web documents and keep them up to date.
     Storage space must be used efficiently to store indices and, optionally, the
     documents themselves. The indexing system must process hundreds of gigabytes of
     data efficiently. Queries must be handled quickly, at the rate of hundreds to
     thousands per second.” – Sergey Brin and Lawrence Page, 19971
In the beginning, there was BackRub, the service that became Google. Today, Google is most
closely associated with its PageRank algorithm. PageRank is a voting algorithm weighted for
importance. The indicators of a Web page’s importance is the number of pages that link to a
particular page.
Messrs. Brin and Page soon added another factor which voted for the importance of a Web
page. This idea was the number of people who click on a Web page. The more clicks on a Web
page, the more weight that Web page was given. Over time, still other factors have been added
to the PageRank algorithm; for example, the frequency with which content on a page is
changed.
Google’s PageRank technology is closely allied with Internet search. Voting algorithms are
less effective in enterprise search, for instance. The attention given to Google and its search
technology dominate popular thinking about the company. Google search is like a nova. The

   1. From “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” www.-
   db.standord.edu/~backrub/google.html


The Google Legacy                                                                           55
Chapter Three: Google Technology




luminescence makes it difficult for the observer to see other aspects of the phenomenon
clearly or easily.
Radiance aside, Google is a technology company.2 Some of that technology when described in
technical papers such as the earliest one “The Anatomy of a Large-Scale Hypertextual Web
Search Engine” is demanding. The later papers such as “MapReduce: Simplified Data
Processing on Large Clusters” can be a slow read.3 Since Google is technology, explaining
what Google does in an easily-digestible meal is difficult. The diagram below provides
unauthorized snapshot of Google’s computing framework.




                 b



                                            a                     d




             c




   Important Google technologies that underlie this diagram of the Googleplex
   include: [a] modifications to Linux to permit large file sizes and other functions so
   as to accelerate the overall system; [b] a distributed architecture that allows
   applications and scaling to be “plugged in” without the type of hands-on set-up
   other operating systems require; [c] a technical architecture that is similar at every
   level of scale; [d] a Web-centric architecture that allows new types of applications
   to be built without a programming language limitation.



     2. The annex to this monograph contains a listing of more than 60 Google patents. The list is
     not all-inclusive; however, it does provide the patent number and a brief description for some of
     Google’s most important patents. The PageRank patent belongs to the trustees of Stanford
     University. Google’s patent efforts have focused on systems and methods for relevance,
     advertising, and other core foci of the company. Google is creating a patent fence to protect its
     interests.
     3. Jeff Dean, former Alta Vista researcher and a Google senior engineer, has been an
     advocate of MapReduce. His most recent papers are available on his Web page at http://
     labs.google.com/people/jeff/.



56                                                                                          The Google Legacy
Chapter Three: Google Technology




Google’s technology has emerged from a series of continuous improvements or what Japanese
management consultants call kaizan. Each Google technical change may be inconsequential to
the average user of Google. But when taken as a whole, Google’s “technological advantage”
comes from Google’s incremental innovations, clever adaptations of research-computing
concepts, and Byzantine tweaks to Linux. Some day, a historian of technology will be able to
identify, from the hundreds of improvements that Google has engineered in the last nine years,
one or two that stand with PageRank as of major importance. Critics of Google will see that
the company has grafted to its core technology processes from many different sources.
To illustrate, the structure of Google’s data centers and the messages passed to and from these
data centers is in many ways a variant of grid computing.4 Google’s ability to read data from
many computers simultaneously is reminiscent of BitTorrent’s technology.5 Google’s use of
commodity or “white box” hardware in its data centers is an indication of Google’s hacker
ethos. The use of memory and discs to store multiple copies of data comes from the frontiers
of computing.
Google’s approach to technology, then, is eclectic and in many ways represents a building
block approach to large-scale systems. Google benefits from that eclecticism in several ways.
First, Google’s computational framework delivers sizzling performance from low-cost
hardware. Second, Google worked around the bottlenecks of such operating systems as
Solaris, Windows Advanced Server, and off-the-shelf Linux. Third, Google took good
programming ideas from other languages, implementing new functions and libraries to
eliminate most of the manual coding required to parallelise an application across Google’s
servers.6
According to Jeff Dean, one of Google’s senior engineers, “Google engineering is sort of
chaotic.”7 This is neither surprising nor necessarily a negative. The Googleplex is a toy box
for engineers and programmers. The tools are sophisticated. The challenges of the problems
and peers make Google “the place to be” for the best and brightest technical talent in the
world. The nature of creativity combined with Google’s approach to innovation make it
difficult to predict the next big thing from Google.
Before reviewing selected parts of Google’s technology in somewhat more detail, the diagram
“Google’s Computing Framework” provides an overview of the Googleplex and some of its
technologies. These will be touched upon in this section.



   4. Grid computing is applying resources from many computers in a network to a single problem
   or application. Google uses grid-like technology in its distributed computing system.
   5. BitTorrent is a peer-to-peer file distribution tool written by programmer Bram Cohen in
   2001.The reference implementation is written in Python and is released under the MIT License.
   6. Google has anywhere from 100,000 to 165,000 or more servers. Servers are organized into
   clusters. Clusters may reside within one rack or across multiple racks of servers. Some Google
   functions are distributed across data centers.
   7. From Dr Dean’s speech at the University of Washington in October 2003. See http://
   www.uwtv.org/programs/displayevent.asp?rid=2459.


The Google Legacy                                                                              57
Chapter Three: Google Technology




PageRank requires a lot of computing horsepower cycles to work. When Google got
underway in 1996, Messrs. Brin and Page had limited computing horsepower. In order to
make PageRank work, they had to figure out how to get the PageRank algorithm to run on
garden-variety computers available to them.
From the beginning – and this is an important issue with regards to Google’s almost-certain
collision course with Microsoft – Google had to solve both software engineering and
hardware engineering issues to make Google Search viable. In fact, when discussing Google
technology, it is important to keep in mind that PageRank is important only because it can run
quickly in the real world, not in a sterile computer lab illuminated with the blue glow of
supercomputers.
The figure Google’s Fusion: Hardware and Software Engineering shows that Google’s
technology framework has two areas of activity. There is the software engineering effort that
focuses on PageRank and other applications. Software engineering, as used here, means
writing code and thinking about how computer systems operate in order to get work done
quickly. Quickly means the sub one-second response times that Google is able to maintain
despite its surging growth in usage, applications and data processing.
 Google’s Fusion: Hardware and Software Innovations



                                                        The Google phenomenon comes from
                                                        the fission occurring when PageRank’s
                                                        software and hardware engineering
                                                        interact. Google’s technology delivers
                                                        super computer applications for mass
                                                        markets.




The other effort focuses on hardware. Google has refined server racks, cable placement,
cooling devices, and data center layout. The payoff is lower operating costs and the ability to
scale as demand for computing resources increases. With faster turnaround and the



58                                                                      The Google Legacy
Chapter Three: Google Technology




elimination of such troublesome jobs as backing up data, Google’s hardware innovations give
it a competitive advantage few of its rivals can equal as of mid-2005.
PageRank with its layering of additional computations added over the years is a software
problem of considerable difficulty. The Google system must find Web pages and perform
dozens, if not hundreds of analyses of those Web pages. Consider the links pointing to a Web
page. Google must keep track of them for more than eight billion Web pages. For a single Web
page with one link pointing to it, the problem is trivial. One link equals one pointer. But what
happens when a site has 10,000 links pointing to it? The problem becomes many times larger
and more computationally demanding. Some of these links are likely to come from sites that
have more traffic than others. Some of the links may come from sites that have spoofed
Google for fun or profit. The calculations to sort out the “value” of each of these links adds to
computational work associated with PageRank. Keeping track of these factors is a big job.
Sizing up different factors against one another for a single page can be hard without a
calculator to help. Take the same task and apply it by a couple of billion Web pages, and the
computing task becomes one for a supercomputer.
Yet this task is everyday stuff for Google and its PageRank process. Users do not give much
thought to what technology underpins a routine query or the 300 million queries Google
handles each day. In a single second, Google’s technology handles around 340 queries in
dozens of languages from users worldwide.
Google’s technology cannot be separated from search. Search was the prime mover in the
Google universe. Once Messrs. Brin and Page were able to fiddle with a limited number of
commodity computers and make their PageRank algorithm work, Google was headed down a
road that it still follows.
The software requires a suitable hardware and network infrastructure in which to operate.
Without Google’s hardware and software, there would be no Google. Hardware and software
are inextricably linked at Google. With each new advance in software, Google’s engineers
must make correspondingly significant advances in hardware. And when hardware engineers
come up with an advance, the software engineers greedily use that advance to up the
functionality of their software.
What Google owns is its own snappy, turbocharged supercomputer, interesting software tools,
and several thousand people trying to figure out what else the Googleplex can do. Some of the
tinkerers come at the problem from bits and bytes, writing code, and weaving applications out
of the available functions. The result is a brilliant product.
Others come at the problem from the soldering iron and screwdriver angle. These engineers
look for ways to build hardware and physical systems that can perform the calculations needed
to make PageRank work. Google’s approach to data centers, the racks in the data centers, and
the devices in the racks in the data centers is as clever as the company’s search system. The
hardware has to be more than clever. The hardware has to work 24x7, under continuous load,
and in locations from Switzerland to Beijing. The synergy between software and hardware is
perhaps one of Google’s major accomplishments.



The Google Legacy                                                                            59
Chapter Three: Google Technology




How Google Is Different from MSN and Yahoo
Google’s technology is simultaneously just like other online companies’ technology, and very
different. A data center is usually a facility owned and operated by a third party where
customers place their servers. The staff of the data center manage the power, air conditioning
and routine maintenance. The customer specifies the computers and components. When a data
center must expand, the staff of the facility may handle virtually all routine chores and may
work with the customer’s engineers for certain more specialized tasks.
Before looking at some significant engineering differences between Google and two of its
major competitors, review this list of characteristics for a Google data center.
     1   Google data centers – now numbering about two dozen, although no one outside Google
         knows the exact number or their locations. They come online and automatically, under
         the direction of the Google File System, start getting work from other data centers.
         These facilities, sometimes filled with 10,000 or more Google computers, find one
         another and configure themselves with minimal human intervention.
     2   The hardware in a Google data center can be bought at a local computer store. Google
         uses the same types of memory, disc drives, fans and power supplies as those in a
         standard desktop PC.
     3   Each Google server comes in a standard case called a pizza box with one important
         change: the plugs and ports are at the front of the box to make access faster and easier.
     4   Google racks are assembled for Google to hold servers on their front and back sides.
         This effectively allows a standard rack, normally holding 40 pizza box servers, to hold
         80.
     5   A Google data center can go from a stack of parts to online operation in as little as 72
         hours, unlike more typical data centers that can require a week or even a month to get
         additional resources online.
     6   Each server, rack and data center works in a way that is similar to what is called “plug
         and play.” Like a mouse plugged into the USB port on a laptop, Google’s network of data
         centers knows when more resources have been connected. These resources, for the most
         part, go into operation without human intervention.
Several of these factors are dependent on software. This overlap between the hardware and
software competencies at Google, as previously noted, illustrates the symbiotic relationship
between these two different engineering approaches. At Google, from its inception, Google
software and Google hardware have been tightly coupled. Google is not a software company
nor is it a hardware company. Google is, like IBM, a company that owes its existence to both
hardware and software. Unlike IBM, Google has a business model that is advertiser supported.
Technically, Google is conceptually closer to IBM (at one time a hardware and software
company) than it is to Microsoft (primarily a software company) or Yahoo! (an integrator of
multiple softwares).




60                                                                        The Google Legacy
Chapter Three: Google Technology




Software and hardware engineering cannot be easily segregated at Google. At MSN and Yahoo
hardware and software are more loosely-coupled. Two examples will illustrate these
differences.
Microsoft – with some minor excursions into the Xbox game machine and peripherals –
develops operating systems and traditional applications. Microsoft has multiple operating
systems, and its engineers are hard at work on the company’s next-generation of operating
systems. Microsoft does not design or make its own hardware. Its operating systems are coded,
for example, for processors that evolved from the Intel chips for personal computers. Recently
Microsoft embarked on a new path with its game machine, the Xbox 360. The new Xbox uses
a processor from IBM’s family of PowerPC chips also used in the Macintosh computer, the
Sony PS/3, and Nintendo next-generation game machines. Microsoft’s applications run on
Microsoft operating systems, although a version of Microsoft Office and Internet Explorer run
on Apple’s Macintosh.
In addition, Microsoft buys hardware from various suppliers to run its online systems. Most of
these suppliers, not surprisingly, are certified by Microsoft. Examples include Microsoft’s use
of Dell Computers. Microsoft’s engineers use these machines in configurations required by the
Microsoft operating systems and applications. For example, Microsoft servers often require a
load balancing feature. Microsoft implements its load balancing via software. When more
performance is required, Microsoft upgrades the hardware, adds memory, or shifts to higher-
speed hard drive technology instead of recoding the operating system itself to deliver higher
performance as Google does. Once a function is released to customers, Microsoft’s engineers
focus on stamping out bugs. Re-engineering a software application for higher performance is
not typically a priority.
Several observations are warranted:
   1   Unlike Google, Microsoft does not focus on performance as an end in itself. As a result,
       Microsoft gets performance the way most computer users do. Microsoft buys or
       upgrades machines. Microsoft does not fiddle with its operating systems and their
       subfunctions to get that extra time slice or two out of the hardware.
   2   Unlike Google, Microsoft has to support many operating systems and invest time and
       energy in making certain that important legacy applications such as Microsoft Office or
       SQLServer can run on these new operating systems. Microsoft has a boat anchor tied to
       its engineer’s ankles. The boat anchor is the need to ensure that legacy code works in
       Microsoft’s latest and greatest operating systems.
   3   Unlike Google, Microsoft has no significant track record in designing and building
       hardware for distributed, massively parallelised computing. The mice and keyboards
       were a success. Microsoft has continued to lose money on the Xbox, and the sudden
       demise of Microsoft’s entry into the home network hardware market provides more
       evidence that Microsoft does not have a hardware competency equal to Google’s.




The Google Legacy                                                                           61
Chapter Three: Google Technology




In terms of technology, Google has the hardware and software engineering expertise to build
applications rapidly, perform computationally-intensive applications quickly, and deliver
high-reliability services from low-cost, commodity hardware.
Yahoo! operates differently from both Google and Microsoft. Yahoo! is in mid-2005 a direct
competitor to Google for advertising dollars. Yahoo! has grown through acquisitions. In
search, for example, Yahoo acquired 3721.com to handle Chinese language search and
retrieval. Yahoo bought Inktomi to provide Web search. Yahoo bought Stata Labs in order to
provide users with search and retrieval of their Yahoo! mail. Yahoo! also owns
AllTheWeb.com, a Web search site created by FAST Search & Transfer. Yahoo! owns the
Overture search technology used by advertisers to locate key words to bid on. Yahoo! owns
Alta Vista, the Web search system developed by Digital Equipment Corp. Yahoo! licenses
InQuira search for customer support functions. Yahoo has a jumble of search technology;
Google has one search technology.
Historically Yahoo has acquired technology companies and allowed each company to operate
its technology in a silo. Integration of these different technologies is a time-consuming,
expensive activity for Yahoo. Each of these software applications requires servers and systems
particular to each technology. The result is that Yahoo has a mosaic of operating systems,
hardware and systems. Yahoo!’s problem is different from Microsoft’s legacy boat-anchor
problem. Yahoo! faces a Balkan-states problem.
There are many voices, many needs, and many opposing interests. Yahoo! must invest in
management resources to keep the peace. Yahoo! does not have a core competency in
hardware engineering for performance and consistency. Yahoo! may well have considerable
competency in supporting a crazy-quilt of hardware and operating systems, however. Yahoo!
is not a software engineering company. Its engineers make functions from disparate systems
available via a portal.
Google also acquires technology. A good example is Picasa. The photo management software
runs on the user’s Windows PC.
The program has been integrated with several of Google’s network-centric applications:
     1   Gmail. The user’s images can be uploaded and sent via email to friends, colleagues and
         family. A Picasa user without a Gmail account is able to register and receive a user
         name and password. The Gmail account can also be used, if the user wishes, for other
         Google services, including Fusion, which is Google’s personalized portal, and the
         search history function, which saves a registered user’s Google queries for later
         reference.
     2   Blog Publishing. The user can post pictures to a Google property, Blogger.com. The
         image publishing function is simplified to one or two clicks. Posting images on some
         Web log systems is beyond the expertise of many computer users.
     3   Image Printing. The user can send images to online photo processing services.




62                                                                      The Google Legacy
Chapter Three: Google Technology




  One-click access to functions
  performed on the user’s local
  computer.




  Recently-viewed images                                                One-click access to network
                                                                        services available as part of the
                                                                        user’s virtual application.



In sharp contrast to Yahoo’s approach, Google integrated the Picasa application into the
Googleplex. The “hooks” are painless to the user.8 Google has bundled into one free
application point-and-click solutions to make management of digital still images intuitive and
fluid. Yahoo!’s acquisitions, in general, are not woven into a seamless experience with other
Yahoo! services. Consider the 3721.com search system. That service remains a separate
Chinese language operation available from mostly non-English Yahoo pages. Google
constructs an application using some code on the user’s PC and other software running on the
Googleplex somewhere on the Internet.
These three companies, different in structure and technical focus, are on a collision course.
Like vessels in America’s Cup, each is going toward the same goal, but subject to forces
difficult for their helmsman to control. Even though there is market space between the three,


   8. Picasa requires a download. The installation process is smooth. Indexing speed was about
   five times faster than ACDSee’s image management program, a competitive product. With
   Picasa, Google’s technologists demonstrate a rapid, trouble-free installation and an intuitive
   interface.


The Google Legacy                                                                                 63
Chapter Three: Google Technology




collisions are inevitable. The figure below provides an overview of the mid-2005 technical
orientation of Google, Microsoft and Yahoo.




MSN,   and by extension Microsoft Corporation, has a core competency in software. The
company has grown from its operating system roots to provide a range of products for mobile
devices, desktop and notebook computers, and enterprise-class servers. Looking forward, the
company’s Dot Net technology is Microsoft’s framework for virtual applications. In some
ways, Dot Net is a less-open version of the AJAX technology that Google uses in the Google
Maps and Gmail products. Microsoft has expended great effort to push Windows downward to
mobile devices and outward to network-centric computers in an effort to increase revenue. For
Microsoft to continue to be the dominant force in software in the future, the company must be
able to capture a commanding share of the market for network-centric applications. However,
Microsoft’s position (whether real or perceived) is its products’ vulnerability to security
breaches. Patch after patch, problem after problem, then promise after promise have done little
to bolster the firm’s credibility for delivering secure systems and software. Looking forward
over the next 12 to 18 months, Microsoft’s prospects hinge on security, cost and its developer
community. The growth of open source alternatives are hard proof that die-hard Microsoft
users are willing to shift for security, cost savings and functionality. Microsoft has weaknesses
that can be attacked by Google and other competitors.
Yahoo’s situation is typical to many American organizations. Most large US corporations are a
hotch-potch of different systems, incompatible architectures and a Tower of Babel of data
formats. For Yahoo to deliver specific markets to its advertisers, Yahoo must integrate
information from disparate systems and be able to segment and deliver ads to those users
efficiently. Yahoo is now spending money to break down the walls of its data silos and
integrating its user data. If Yahoo cannot deliver narrowly segmented markets, advertisers
may abandon Yahoo for services that offer more targeted marketing opportunities. After years
of flirting with becoming a New Age America Online, Yahoo is beginning to behave like a
traditional media company.


64                                                                      The Google Legacy
Chapter Three: Google Technology




MSN  and Yahoo! are becoming ad-supported versions of general-interest portals like Yahoo,
America Online and Tiscali. In contrast, Google is focusing on applications that tie users to its
Googleplex. The company’s focus on hardware and software engineering gives it a cost and
performance advantage over MSN and Yahoo, among others competing in Web search.
Google’s high-performance, homogeneous Googleplex means that the company does not
struggle with some integration, performance and cost issues that bedevil Microsoft and MSN.
Google may not be doing everything right from a computer science point of view. Compared
to MSN or Yahoo, Google is doing less wrong than these two aggressive competitors.

The Technology Precepts
Google’s technology uses concepts and techniques from the leading edge of computer science.
Most of these innovations are difficult to explain to engineers steeped in traditional approaches
to massively distributed, highly parallelized computing. The eclectic footnotes and references
in the earlier BackRub paper have been sharpened in Google’s later technical presentations.
Readers without a first-hand understanding of NOW-Sort, River, and BAD-FS are unlikely to
craft dinner conversation from Google’s explanations of the influence of these research
computing demonstrations.9
For the purposes of this monograph and understanding the nature of Google’s technology, five
precepts thread through Google’s technical papers and presentations. The following snapshots
are extreme simplifications of complex, yet extremely fundamental, aspects of the
Googleplex.

Cheap Hardware and Smart Software
Google’s use of commodity hardware for high-demand, 24x7 systems has existed as a core
precept since 1996. Most of its competitors’ online systems combine branded hardware from
IBM, Sun Microsystems, Hewlett-Packard, and Dell Computers with specialized peripherals.
The operating systems in use are a combination of Unix and Microsoft operating systems with
some Linux and open source components.
Google approaches the problem of reducing the costs of hardware, set up, burn-in and
maintenance pragmatically. A large number of cheap devices using off-the-shelf commodity
controllers, cables and memory reduces costs. But cheap hardware fails.
In order to minimize the “cost” of failure, Google conceived of smart software that would
perform whatever tasks were needed when hardware devices fail. A single device or an entire
rack of devices could crash, and the overall system would not fail. More important, when such
a crash occurs, no full-time systems engineering team has to perform technical triage at 3 a.m.

   9. See for example Andrea C. Arpaci-Dusseau, et. al. “HIgh Performance Sorting on Network
   of Workstations”. In Proceedings of the 1997 ACM SIGMOD International Conference on
   Management of Data, Tucson, Arizona, May 1997 or John Bent, et. al. “Explicit Control in a
   Batch-Aware Distributed File System”. Both contained in Proceedings of the 1st USENIX
   Symposium on Networked Systems Design and Implementation. March 2004.


The Google Legacy                                                                            65
Chapter Three: Google Technology




The focus on low-cost, commodity hardware and smart software is part of the Google culture.
In one presentation at a December 2004 technical conference, a Google spokesman joked that
anyone in the room could buy the same hardware that Google uses at Frye’s Electronics, a
retail chain with stores in Palo Alto and other cities in California.

Logical Architecture
Google’s technical papers do not describe the architecture of the Googleplex as self-similar.
Google’s technical papers provide tantalizing glimpses of an approach to online systems that
makes a single server share features and functions of a cluster of servers, a complete data
center, and a group of Google’s data centers.
The diagram below shows a representation of the Googleplex’s tightly organized, highly
regular organization of files, servers, clusters, and more than two dozen data centers in a stable
organizational pattern.10
 The Googleplex                                                                          A data centre
 is a larger                                                                             uses the same
 instance of the                                                                         design and is
 organization of                                                                         composed of
 a single pizza                                                                          racks.
 box server.
                                                                                        A single Google
                                                                                        cluster embodies
                                                                                        the same
                                                                                        organizing
 A single                                                                               principle as a
 replicated                                                                             single pizza box
 Google file                                                                            server
 reflects the
 controllling                                                                          A single Google
 organizing                                                                            pizza box server
 principle




The diagram illustrates that Google’s technical infrastructure is similar at every level in the
Googleplex. The collections of servers running Google applications on the Google version of
Linux is a supercomputer. The Googleplex can perform mundane computing chores like
taking a user’s query and matching it to documents Google has indexed. Further more, the
Googleplex can perform side calculations needed to embed ads in the results pages shown to
user, execute parallelized, high-speed data transfers like computers running state-of-the-art
storage devices, and handle necessary housekeeping chores for usage tracking and billing.


     10.The illustration is a Sierpinkski Triangle, chosen because it conveys how each component
     in Google’s infrastructure replicates other larger combinations of servers and data centers. The
     overall structure – in this illustration an equilateral triangle – expresses the stability of the
     Google approach to its system. This famous fractal connotes how Google scales without
     altering the micro or macro structure of the Googleplex.



66                                                                              The Google Legacy
Chapter Three: Google Technology




What is of interest is that Google does this with low-cost commodity hardware running on
Google’s version of Linux. Google has infused the Googleplex with logic that allows software
to handle data recovery, to streamline messages passed from server to server, and to grab
additional computing resources in order to complete a job quickly. When Google needs to add
processing capacity or additional storage, Google’s engineers plug in the needed resources.
Due to self-similarity, the Googleplex can recognize, configure and use the new resource.
Google has an almost unlimited flexibility with regard to scaling and accessing the capabilities
of the Googleplex. Unlike a collection of different building materials, Google’s approach
delivers a homogeneous computing system.
A good example is bringing a new rack of 40 or more pizza box servers online and creating
one of the many types of servers Google users.11 Servers, according to the fractal architecture,
consist of two or more clusters of pizza boxes. A cluster allows data to be replicated and work
shared among pizza boxes with spare capacity. A rack is assembled and then Google’s pizza
box servers are “plugged in.” Cables are attached among the pizza boxes and the rack is then
plugged into a network hub. An engineer turns on the power, and the other devices become
aware of the new rack’s resources. Master servers – Google’s term for the pizza box that is in
charge of one or more clusters – instruct other servers to copy data to the new cluster and begin
using the clusters to do work.
In Google’s self-similar architecture, the loss of an individual device is irrelevant. In fact, a
rack or a data center can fail without data loss or taking the Googleplex down. The Google
operating system ensures that each file is written three to six times to different storage devices.
When a copy of that file is not available, the Googleplex consults a log for the location of the
copies of the needed file. The application then uses that replica of the needed file and
continues with the job’s processing. Redundancy and other engineering tweaks to Linux gives
the Googleplex ways to eliminate or reduce the bottlenecks associated with traditional online
computer systems’ operation. The Google technical recipe includes distributed computing,
optimized file handling, and embedded logic to make the servers working on tasks smarter.
This architecture allows Google to expand its computational capacity, its storage and its
supported applications with an ease and price point rivals cannot easily match. According to
Jeff Dean, one of Google’s senior engineers, “At Google, everything is about scale.”12

Speed and Then More Speed
Google Search is fast with most results coming back to the user in less than one second. In
commercial data centers, speed has traditionally been achieved by buying high-end, high-
performance hardware from such manufacturers such as Sun Microsystems and using
advanced storage devices connected to the servers by exotic fibre optics.


   11.Data centers use computer cases that are shaped like the boxes used to hold pizzas. The
   term pizza boxes has been appropriated by engineers to describe one of the standard form
   factors for servers housed in rack mounts in data centers.
   12.Statement made at the University of Washington, October 2004


The Google Legacy                                                                             67
Chapter Three: Google Technology




Not Google. Google uses commodity pizza box servers organized in a cluster. A cluster is
group of computers that are joined together to create a more robust system. Instead of using
exotic servers with eight or more processors, Google generally uses servers that have two
processors similar to those found in a typical home computer.
Through proprietary changes to Linux and other engineering innovations, Google is able to
achieve supercomputer performance from components that are cheap and widely available.
The table below provides some data from 2002 about the speed with which Google can read
data from hard drives:13




              These data show the results of two clusters’ performance. Google’s read throughput has
              gone up since 2002. Based on increases in commodity drive throughput, Google’s read
              rate may be close to 2,000 megabytes per second, which may be a Google watchers
              enthusiasm boosting already-robust figures.

To put these data in a context of 2002 technology, consider that an IBM EXP3 storage device
available in 2002 could read data in burst mode at the rate of about 58 MB / second. Google’s
read rate in 2002 averaged ten times the read rate of the IBM EXP The write rate is comparable.
The cost of a single IBM EXP3 in 2002 was about $18,000 for 360 gigabytes of storage,
excluding controller and cables. Google’s cost for comparable storage and the higher
performance was about $1,000. For greater speed, Google spends less. In the world of ever-
increasing demands for speed and storage, Google has a strong one-two punch.14 Advances in
commodity storage devices translate to even faster performance for Google. Google has not
updated its read rate data, but engineers familiar with Google believe that read rates may in
some clusters approach 2,000 megabytes a second. When commodity hardware gets better,
Google runs faster without paying a premium for that performance gain.
Google engineers for computational speed. Google’s approach has been to focus on making its
software engineering produce the turbocharged performance. Speed is crucial to Google’s
PageRank and other analytic processes. If Google’s computational throughput were slow,
Google could not perform the work needed to know that for a particular query, a particular set



     13.From “The Google File System” by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak
     Leung (Google) ACM SOSP 2003 Conference Proceedings 1-58113-757-5/03/0010, page 12.
     14.With Google’s advanced programming tools, Google is able to increase the productivity of
     its engineers. Combined with hardware speed and performance, Google squeezes out more
     productivity by applying its engineering talents to application development. This is a one-two-
     three punch to which Google’s competitors have to respond.



68                                                                                  The Google Legacy
Chapter Three: Google Technology




of indexed Web pages is the best match. Without fast response to a query, users would not be
willing to run multiple queries and interact fluidly with the Google applications.
Google does not mindlessly match key words in a user’s query to the terms in the Google
index. Google’s approach is more subtle and computationally involved, although term
matching is an important part of the Google process. Google reviews data, various scores or
values from certain algorithms. Google then uses these different values in other algorithms to
find search results, identify the best match (Google’s “Feeling Lucky” link), extract matching
ads from its advertising server, and continuously update values as Google users of click on
links. Once these various query and ad matching processes are complete, Google displays the
results page to the user; typically in less than one second across a public network.
Google is a hot rod computer that can perform the basic mathematics needed to deliver most
search results in less than a half second, display maps with the speed of a dedicated desktop
application like Encarta, and look at a Web page matching a user’s query and, in some
applications, insert additional hyperlinks to related content before displaying the results page
to the user. The Googleplex does experience slow downs. When these occur, the Googleplex
allocates additional resources to eliminate the brown out.
Speed has many meanings at Google. Speed means that users can interact with the Google
products and services as if the Google application were running on a dedicated PC in front of
the user. Speed also means that Google must be able to expand its computational and storage
capacity quickly. Speed also means rapid development and deployment of new products.
Speed, like Google’s ability to scale, is a core functionality of the Googleplex.
Google applies its high-speed technology to search and to other types of servers. Among the
servers using Google’s go-fast technology are those shown below:

         Type                                        Function

 Advertising server   Delivers text and other paid advertisements for AdWords and AdSense.

 Chunkserver          Schedules and delivers blocks of data for further processing.

 Image servers        Serves images for Google Image, Print and Video services.

 Index server         The workhorse of search. Server handles search-and-retrieval.

 Mail server          Delivers the Gmail service.

 News server          Gathers, analyses and displays news.

 Web server           Orders results and makes them available to users.

What does the combination of go-fast technology plus multiple types of Google data allow the
company to do? Google can engage in fast new product development. One example is Google
Maps. Google developed a basic mapping product over the course of 2004. In late 2004,
Google purchased Keyhole. By June 30, 2005, Google had:
    1   Released a basic mapping product.


The Google Legacy                                                                                     69
Chapter Three: Google Technology




     2   Integrated information from Google Local in early 2005.
     3   Hooked Keyhole satellite imagery into Google Maps in early May 2005.
     4   Announced Google Earth in May 2005.
     5   Upgraded the system to integrate two dimensional point-to-point routes on top of
         satellite imagery.
     6   Demonstrated a function that accepts a query in another language, translates the results
         to the user’s language, and displays the data in a three-dimensional mode.
The image below shows that Google’s Map and Earth service pushes the functions of online
map and data integration to another level. In the span of several days, Google integrated
Keyhole technology, launched, upgraded and redefined online mapping services.15




 This is the results of a Japanese language Google Maps-Earth query for the location of Wendy’s
 restaurants in New York City. The addition of the Japanese language support, the three-dimensional
 view of the section of Manhattan where the user wants directions, and the integration of hot links, the
 two dimensional map, and information about the restaurants was part of Google’s fast-cycle launch
 and enhancement program designed to beat Microsoft to the market.

Another key notion of speed at Google concerns writing computer programs to deploy to
Google users. Google has developed short cuts to programming. An example is Google’s
creating a library of canned functions to make it easy for a programmer to optimize a program
to run on the Googleplex computer. At Microsoft or Yahoo, a programmer must write some

     15.The source for this image was http://blog.eee-craft.com/archives/23345086.html.



70                                                                                       The Google Legacy
Chapter Three: Google Technology




code or fiddle with code to get different pieces of a program to execute simultaneously using
multiple processors. Not at Google. A programmer writes a program, uses a function from a
Google bundle of canned routines, and lets the Googleplex handle the details. Google’s
programmers are freed from much of the tedium associated with writing software for a
distributed, parallel computer. What does increased programmer productivity mean? In terms
of money, Google makes each engineering dollar go farther. If a single programmer can reduce
by 10 percent the time required to code a program, the savings could be several thousand
dollars. If a programmer can slash coding time in half, Google gets twice the potential
productivity out of each of its 3,000 plus programmers.16

Eliminate or Reduce Certain System Expenses
Some lucky investors jumped on the Google bandwagon early. Nevertheless, Google was
frugal, partly by necessity and partly by design. The focus on frugality influenced many
hardware and software engineering decisions at the company. Spending money wisely does
not mean cheaply. Examples of how Google eliminates or reduces certain system expenses
include:
   • Google eliminates the costs associated with backing up and restoring data when a
     hardware failure occurs. The fractal principal requires that Google replicate data three to
     six times elsewhere in the Googleplex. When a device fails, the “master server” for a
     task looks at a file that tells where the other copies of the data or the programs are. The
     “master server” then uses those data or those processes to complete a task. No tape, no
     human intervention, and no downtime; Google does not have these costs due to its
     engineering acumen.
   • Google does not have to certify new hardware. When additional storage or
     computational capacity is required, Google technicians assemble one or more racks of
     Google “pizza boxes.” Once in the rack, the Googleplex recognizes the new resources in
     a way that is similar to how a laptop knows when a user plugs in a USB mouse. The
     expensive certification processes otherwise required for some high-end hardware are
     eliminated. Google engineers plug in resources and let the Googleplex handle the other
     tasks.
   • Google innovation uses open source code as a starting point. Many of Google’s most
     striking technical advances are based on modifying open source software to benefit from
     insights gained from experimental results in supercomputing. Google does not have to
     work around known bottlenecks in some commercial operating systems. Unlike
     Microsoft, Google did not write a complete operating system for its Googleplex. Google
     made key changes to Linux, adding necessary services and functions to meet the specific
     requirements of Google applications. Google’s approach is pragmatic and less time-


   16.Some Google programmers have complained about the peer pressure to perform. Google
   management faces a challenge in managing its programming talent. Staff burn out or defections
   could impair Google’s technical resources.


The Google Legacy                                                                              71
Chapter Three: Google Technology




       consuming than Microsoft’s “death march” to get Longhorn shipped by late 2006.
       Compared with Yahoo, Google’s approach is more cohesive. Yahoo faces integration
       drudgery as a result of its multiple systems and heterogeneous hardware and data.
       Google has used Linux, standards, and open source software for virtually all of its core
       services and thus spends less time pounding disparate systems and data into a standard
       type.17
     • Google does not spend money for high-performance devices to make its system perform
       faster.
To illustrate the financial payoff from the use of commodity hardware, Google engineers
revealed a back-of-the-envelope calculation. Although dated, it underscores the economies of
the Google approach:18
       The cost advantages of using inexpensive, PC-based clusters over high-end
       multiprocessor servers can be quite substantial, at least for a highly parallelisable
       application like ours. For example, a $278,000 rack contains 176 2-GHz Xeon CPUs,
       176 Gbytes of RAM, and 7 Tbytes of disk space. In comparison, a typical x86-based
       server contains eight 2-GHz Xeon CPUs, 64 Gbytes of RAM, and 8 Tbytes of disk
       space; it costs about $758,000. In other words, the multi-processor server is
       about three times more expensive but has 22 times fewer CPUs, three times less
       RAM, and slightly more disk space. Much of the cost difference derives from the
       much higher interconnect bandwidth and reliability of a high-end server, but again,
       Google’s highly redundant architecture does not rely on either of these attributes.
       [Emphasis added]
This means that when Microsoft of Yahoo! spends US$3.00 for better performance, Google
spends less than US$1.00.19 Over time, competitors such as IBM, Microsoft or Yahoo may
implement similar features into their network-centric services. Until then, Google has a cost
advantage at least with regards to scaling online operations. If these 2002 data can be
accepted, Google spends one-third for more computing horsepower and disc space than
companies spend using a traditional server architecture.

Snapshots of Google Technology
Google engineers generate a large volume of technical information. Some of the data are in
the form of patents, often written in a style that communicates little of the patent’s substance
to a lay reader. The link for Google’s publications can shift unexpectedly.20 Exploring


     17.Google does not explicitly state that it has embraced a services oriented architecture or SOA.
     However, many of Google’s practices illustrate an informed use of certain features of SOA.
     18.Luiz André Barroso, Jeffrey Dean, and Urs Hölzle, “Web Search for a Planet: The Google
     Cluster Architecture”, IEEE Computer Society 0272-1732/03, March April 2003.
     19.A review of Google’s cost estimates for this monograph revealed that Google is understating
     its cost advantage by one or two orders of magnitude. As the performance of commodity
     hardware goes up, the cost of that hardware goes down. Bulk purchasing chops as much as 50
     percent off the cost of some hardware. Google can replicate its data and give away free
     gigabytes of email storage. The cost to Google can be as low as a few cents a gigabyte.
     20.See http://labs.google.com/papers.html#compilers on June 1, 2005.



72                                                                              The Google Legacy
Chapter Three: Google Technology




biographies of Google executives and Google Web logs can yield some useful technical
information. For example, one Google biography linked to more than 36 personal projects,
including one by Google’s CEO.21 Surprisingly, Google’s search engine does a hit-and-miss
job of indexing Google’s own technical information.
Useful engineering information appears on the Google Web site. The topics covered in various
monographs, white papers and technical notes concern a wide range of subjects. For example,
in mid-2005, papers were available on such topics as algorithms, compiler optimization,
information retrieval, artificial intelligence, file system design, data mining, genetic
algorithms, software engineering and design, and operating systems and distributed systems,
among others. Google explains its use of very large files as well as how the Google-modified
version of Linux automatically allocates work and avoids the file system bottlenecks that can
plague Solaris and Windows Advanced Server 2003, among others.
Google’s technical papers and Google patents provide some insight into areas of interest at
Google. For example, Google is posting more information about operating systems and
applications. The thrust of Google’s innovation is to build out the search platform and expand
the functionality of its backoffice programs such as those used for advertising services.
The annex to this monograph provides information about more than 60 patents for which
Google is believed to be the assignee. To provide a more fine-grained look at Google
technology, the table below identifies selected examples of innovations documented by
Google engineers or researchers close to the company. Most of these papers appeared prior to
Google’s receiving a patent for the technology referenced in these reports:

      Technology                          Purpose                              To Learn More

 Google Suggest             Helps users find needed information     Services Computing, 2004 IEEE
                            by analysing queries and suggesting     International Conference on (SCC'04) by
                            other queries.                          Stephen Davies, Serdar Badem,
                                                                    Michael D. Williams, Roger King
                                                                    September 2004.

 Video Object Search        User types an object name and Google    Ninth IEEE International Conference on
                            finds that object in a video.           Computer Vision Volume 2 Josef Sivic,
                                                                    Andrew Zisserman Publication Date:
                                                                    October 2003.

 MapReduce                  New functions in Google Linux to        OSDI Proceedings, December 2004.
                            speed programming and other
                            processes involving large data sets.

 Google File System         Extension to Google Linux to allow      ACM Publication 1-58113-757-5/03/
                            high-speed data reads and writes from   0010.
                            commodity drives.




   21.This is the lex project that “helps write programs whose control flow is directed by instances
   of regular expressions in the input stream. It is well suited for editor-script type transformations
   and for segmenting input in preparation for a parsing routine.”


The Google Legacy                                                                                    73
Chapter Three: Google Technology




        Technology                          Purpose                             To Learn More

  Identify Authoritative or   Uses pattern mining in order to         Seventh International Database
  High-Value Sources in       generate a numeric value to indicate    Engineering and Applications
  Web Content                 an authoritative source as an           Symposium (IDEAS'03) Haofeng Zhou,
                              indication of content quality.          Yubo Lou, Qingqing Yuan, Wilfred Ng,
                                                                      Wei Wang, Baile Shi July 2003.

  MetaCrystal                 Metasearch technology to allow a        Second International Conference on
                              single query to retrieve and organize   Coordinated & Multiple Views in
                              results in a visual display.            Exploratory Visualization (CMV'04)
                                                                      Anselm Spoerri July 2004.


Drawbacks of the Googleplex
The coaching mantra, “No pain without gain” is true for Google. Google does make mistakes:
and some big ones. The example fresh in news headlines is Web Accelerator. The product was
introduced in May 2005 and withdrawn less than six weeks later. Speed and nimbleness aside,
Web Accelerator was technology that ran head on into “issues.”Of greater consequence are the
periodic slowdowns for Gmail. The Googleplex is scalable, but until more servers are online,
users may face annoying delays.

Going Too Fast: The Google Web Accelerator
The Web Accelerator software was supposed to use Google servers to store Web pages a user
viewed. Web Accelerator parsed a page in the user’s browser. The Web Accelerator function
then followed each link on that specific page. The page was then stored in a Google cache.
When the user clicked on a link, the user would see the page from the Google cache, thus
reducing the time required to display the page to the user.
Web Accelerator worked fine on such sites as a www.whitehouse.gov, which makes minimal
use of advanced Web services. Unfortunately, the Web Accelerator function followed links
that transmitted instructions to Web applications. For example, Web Accelerator would click
on “delete” links, causing some Web applications such as Backpack to remove the user’s
preferences or content.22 Web Accelerator blithely ignored confirmations generated by
JavaScript so that unintentional instructions were transmitted. Some Google watchers raised
questions about caching data as well as privacy and copyright issues. Before these concerns
reached a crescendo, Google reported that Web Accelerator had reached its capacity. Google
blocked downloads for the product.

The Laws of Physics: Heat and Power 101
Google does not reveal the number of servers it uses, but the number is believed to be in the
150,000 to 170,000 range as of June 30, 2005. Conflicting information surfaces in Web logs
and in talks at conferences. In reality, no one knows. Google has a rapidly expanding number
of data centers. The data center near Atlanta, Georgia, is one of the newest deployed. This

     22.Backpack is a Web application that sends a user the contents of any page as email. See
     www.backpackit.com.



74                                                                               The Google Legacy
Chapter Three: Google Technology




state-of-the-art facility reflects what Google engineers have learned about heat and power
issues in its other data centers. Within the last 12 months, Google has shifted from
concentrating its servers at about a dozen data centers, each with 10,000 or more servers, to
about 60 data centers, each with fewer machines.23 The change is a response to the heat and
power issues associated with larger concentrations of Google servers.
The most failure prone components are:
   • Fans.
   •   IDE   drives which fail at the rate of one per 1,000 drives per day.
   • Power supplies which fail at a lower rate.
Repairs are batch operations. Scheduling the fixes is a major job and work is underway to
improve the Google-developed scheduling capability. Google has to locate hosting facilities
that can meet the company’s heat and power requirements.

Other Data Center Issues
                                  Google data centers have access to multiple high-speed lines
                                  and normal data center functions such as redundant power,
                                  traffic routing and strict rules governing access to the physical
                                  boxes.
                                  PRWeaver’s   Web log contained a posting of a photograph
                                  allegedly taken inside a Google data center. If true, the physical
                                  layout of the racks holding an estimated 2,000 or more servers
                                  squeezes a large amount of hardware in a tightly-packed space.
This type of dense configuration helps explain the comments about Google’s heat and power
concerns. Most data centers were not designed to handle dense concentrations of thousands of
servers. Heat contributes to hard drive failures. On the plus side, the dense configuration
makes set up and maintenance somewhat easier. Google packs servers on two sides of a rack.
A unique property of the data centers is that replicated content can be written from one data
center to another. Google data within the data center are replicated on other servers and other
clusters running in the racks.
The Google “plug and play” engineering philosophy appears to be used in and across data
centers. If a data center, such as the one shown above, needs additional index server capacity,
the technicians in that center can build a Google rack of 40 pizza box servers. These servers
are connected to the network. When the rack is powered up, it becomes available to the master
servers for that data center. These master servers then mark the rack’s resources as available.
Master servers then begin sending work to the new devices. The information about data



   23.These data appear at www.mcdar.net/SEOTools.htm


The Google Legacy                                                                               75
Chapter Three: Google Technology




centers indicates that this “plug and play” concept and automatic discovery of new resources
applies to new data centers, not just the racks within them.
It may be an exaggeration that a Google rack and the data center in which the rack resides
works like a USB mouse. The general concept seems to be what Google engineers have tried to
achieve. By eliminating such tasks as certifying and configuring Small Computer System
Interface RAID storage devices, Google is content to let the auto-discovery functionality alert a
“master server” to a new resource, master servers to alert other master servers, masters to
notify clients of tasks, and data centers to pass information that racks, clusters or a new data
center are available for use.
A a Google engineer said, “Wherever we put a cluster, we have heat, cooling and power
issues. When we put in a data center, that data center operator faces new challenges. We use
each day four megawatts of electric power.”
The problems include:
     1   Heat. Special racks with fans that cool the core of the rack are used.
     2   Power. The power demand at load is greater than data centers typically sustain. “Our
         cages are custom built and there’s a lot of work done by us and the data center people
         before we can flip the switch,” said Jeff Dean, a senior Google engineer.
     3   Network management tools. Google has had to create network management tools to
         manage its self-healing, automatic failover operating system.

What’s Up, Sergey?
The Google data centers are concentrated in North America with other data centers located in
Switzerland, the Pacific Rim, and Beijing.24
Because the GOS is self-healing, the operating system and the various “master computers” in a
cluster know what device is online and what device is dead. Off-the-shelf network
management tools are not tailored to Google’s requirements. Therefore, Google is developing
network management and monitoring tools so that the information in the Google operating
system log files can be displayed in a meaningful way to Google network engineers.
The overall Googleplex works and continues working even if a device, rack or data center
goes dark or dies. Network management tools have to provide a broad range of monitoring
and support functions for the global network, devices, data flows, work loads and potential
problem areas. Google is developing needed network management tools specifically for its the
Googleplex.




     24.The Beijing data center was purpose built to conform to the ruling body’s requirements for
     online access, monitoring and related issues. Google complied in order to do business in China.
     Yahoo! bought 3721.com in order to accelerate its effort in China.



76                                                                            The Google Legacy
Chapter Three: Google Technology




Unanticipated Faults Could Derail Google’s Juggernaut
Google’s network uses a number of concepts from the fringes of computer innovation as well
as its hands-on knowledge gained by from the Googleplex itself. The result is a highly-
resilient network that may breed problems not previously encountered. Although Google has
operated for more than five years without downtime from system failure, the possibility –
however remote – does exist that something unanticipated could occur. A sufficiently large
problem could deal Google a severe blow. The advanced technology of Google’s MapReduce
tool and its 400 module library could pose as yet unforeseen technical problems.




  The diagram shows how Google’s approach eliminates the bottleneck in parallelized systems
  produced by excessive message traffic flowing through a server coordinating work among different
  computers. This is a diagram produced by Google engineers.



Summary of Google’s Drawbacks
Critics of Google can point to three “problems” with Google’s approach to performance.
First, Google is a one-trick pony. The changes to Linux and the other technical modifications
are little more than hackers’ attempts to squeeze a small performance gain.
Second, Google’s use of commodity hardware and cheap storage is a risky solution. Unknown
problems may lurk when cheap components are used in a mission-critical system. Increasing
the potential risk are the changes Google makes to speed up program execution.




The Google Legacy                                                                                      77
Chapter Three: Google Technology




Finally, other operating systems – including those from computer research laboratories and
even Microsoft – do the same things and have for years.

Leveraging the Googleplex
Google has demonstrated that search is just one application that can run in the Google
environment. There are many other applications that can benefit from Google’s approach to
online services.
     1   Applications that require a high performance payoff for a low cost such as electronic
         mail.
     2   An application that can run in Google’s redundant environment where there is no
         private-state replication such as found in IBM’s AS/400 operating environment and
         others.
     3   Computationally-intensive, stateless applications.
     4   Applications that require request-level parallelism, a characteristic exploitable by
         running individual requests on separate servers such as Google Earth.
There is little to be gained by trotting out war-horses to trample Google. The user experience
speaks for itself. Google’s approach to massively-parallel distributed computing works, even
on dial-up networks.
Google fused the type of thinking associated with small, cash-strapped companies with
techniques from advanced computer systems. Commodity products keep costs down. A
modified Linux delivers fast performance at a bargain basement cost. Google is taking a
strategic risk with commodity hardware and a souped up version of Linux. Each day Google
bets that its technologists can keep the system humming.
Another reason why Google’s approach to technology is paying off is that Google employes
the same pragmatism and cleverness in application development. Google uses standard
engineering practices, proprietary knowledge, and off-the-shelf techniques such as its use of
Web services. Google uses the same Web programming techniques that millions of Web
developers use. The payoff is that it is easy for Google to hire people who can code for the
Googleplex. Google so far has not had to spend money for developer marketing programs or
train new hires to work in the Googleplex.
The biggest boost to Google’s technical approach is that its competitors are following
different, more expensive approaches. Yahoo is a fruit cake of hardware, operating systems,
and applications coded at different times in different languages by different people. Microsoft
uses its own operating systems but relies on other operating systems as well, including Solaris.
Microsoft’s must invest in hardware to squeeze performance out of its platforms. Yahoo
wrestles with its many different platforms. Microsoft seems powerless to enhance the speed of
its operating system. Both are digital ostriches burying their heads in their own marketing
material.



78                                                                     The Google Legacy
Chapter Three: Google Technology




Google’s technology is one major challenge to Microsoft and Yahoo. So to conclude this
cursory and vastly simplified look at Google technology, consider these items:
   1   Google is fast anywhere in the world.
   2   Google learns. When the heat and power problems at dense data centers surfaced,
       Google introduced cooling and power conservation innovations to its two dozen data
       centers.
   3   Programmers want to work at Google. “Google has cachet,” said one recent University
       of Washington graduate.
   4   Google’s operating and scaling costs are lower than most other firms offering similar
       businesses.
   5   Google squeezes more work out of programmers and engineers by design.
   6   Google does not break down, or at least it has not gone offline since 2000.
   7   Google’s Googleplex can deliver desktop-server applications now.
   8   Google’s applications install and update without burdening the user with gory details
       and messy crashes.
   9   Google’s patents provide basic technology insight pertinent to Google’s core
       functionality.
A young programmer in Osaka or Beijing is very likely to have been influenced by Google.
The skilled programmers want to work at Google, develop for the Googleplex, and, if possible,
create their own Google killer. The mantra is, “Be like Sergey and Larry”.
Google has a next-generation computing platform. That platform is optimised to deliver
virtual applications to its users worldwide. Google uses standard Web technologies in clever
ways. Although the technical challenges facing Google are formidable, the company has
advanced the art of online computing.




The Google Legacy                                                                            79

Contenu connexe

Similaire à Google technology

Google Cloud Platform - Introduction & Certification Path 2018
Google Cloud Platform - Introduction & Certification Path 2018Google Cloud Platform - Introduction & Certification Path 2018
Google Cloud Platform - Introduction & Certification Path 2018Pavan Dikondkar
 
Deep dive into Google Cloud for Big Data
Deep dive into Google Cloud for Big DataDeep dive into Google Cloud for Big Data
Deep dive into Google Cloud for Big DataTu Le Dinh
 
Big data on google cloud
Big data on google cloudBig data on google cloud
Big data on google cloudTu Pham
 
GPU accelerated Large Scale Analytics
GPU accelerated Large Scale AnalyticsGPU accelerated Large Scale Analytics
GPU accelerated Large Scale AnalyticsSuleiman Shehu
 
Google Developers Summit Tokyo - Google Cloud Platform で知る Google クラウドの「Googl...
Google Developers Summit Tokyo - Google Cloud Platform で知る Google クラウドの「Googl...Google Developers Summit Tokyo - Google Cloud Platform で知る Google クラウドの「Googl...
Google Developers Summit Tokyo - Google Cloud Platform で知る Google クラウドの「Googl...Google Cloud Platform - Japan
 
Designing Distributed Systems: Google Cas Study
Designing Distributed Systems: Google Cas StudyDesigning Distributed Systems: Google Cas Study
Designing Distributed Systems: Google Cas StudyMeysam Javadi
 
GCP Online Training | GCP Data Engineer Online Course
GCP Online Training | GCP Data Engineer Online CourseGCP Online Training | GCP Data Engineer Online Course
GCP Online Training | GCP Data Engineer Online CourseJayanthvisualpath
 
Seattle Scalability Meetup intro pptx - June 26
Seattle Scalability Meetup intro pptx - June 26Seattle Scalability Meetup intro pptx - June 26
Seattle Scalability Meetup intro pptx - June 26clive boulton
 
Getting started with GCP ( Google Cloud Platform)
Getting started with GCP ( Google  Cloud Platform)Getting started with GCP ( Google  Cloud Platform)
Getting started with GCP ( Google Cloud Platform)bigdata trunk
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop EMC
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediAnimesh Chaturvedi
 
Seminar Report on Google File System
Seminar Report on Google File SystemSeminar Report on Google File System
Seminar Report on Google File SystemVishal Polley
 
MongoDB World 2016: Lunch & Learn: Google Cloud for the Enterprise
MongoDB World 2016: Lunch & Learn: Google Cloud for the EnterpriseMongoDB World 2016: Lunch & Learn: Google Cloud for the Enterprise
MongoDB World 2016: Lunch & Learn: Google Cloud for the EnterpriseMongoDB
 
App Engine Application for Detecting Similar Files in Google Drive
App Engine Application for Detecting Similar Files in Google DriveApp Engine Application for Detecting Similar Files in Google Drive
App Engine Application for Detecting Similar Files in Google DriveIRJET Journal
 
Google Final Draft With Kts
Google Final Draft With KtsGoogle Final Draft With Kts
Google Final Draft With KtsJoseph Teye-Kofi
 
BigData Meets the Federal Data Center
BigData Meets the Federal Data CenterBigData Meets the Federal Data Center
BigData Meets the Federal Data CenterAbe Usher
 
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016Chris Jang
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 

Similaire à Google technology (20)

Google Cloud Platform - Introduction & Certification Path 2018
Google Cloud Platform - Introduction & Certification Path 2018Google Cloud Platform - Introduction & Certification Path 2018
Google Cloud Platform - Introduction & Certification Path 2018
 
Deep dive into Google Cloud for Big Data
Deep dive into Google Cloud for Big DataDeep dive into Google Cloud for Big Data
Deep dive into Google Cloud for Big Data
 
Big data on google cloud
Big data on google cloudBig data on google cloud
Big data on google cloud
 
GPU accelerated Large Scale Analytics
GPU accelerated Large Scale AnalyticsGPU accelerated Large Scale Analytics
GPU accelerated Large Scale Analytics
 
Google Developers Summit Tokyo - Google Cloud Platform で知る Google クラウドの「Googl...
Google Developers Summit Tokyo - Google Cloud Platform で知る Google クラウドの「Googl...Google Developers Summit Tokyo - Google Cloud Platform で知る Google クラウドの「Googl...
Google Developers Summit Tokyo - Google Cloud Platform で知る Google クラウドの「Googl...
 
Designing Distributed Systems: Google Cas Study
Designing Distributed Systems: Google Cas StudyDesigning Distributed Systems: Google Cas Study
Designing Distributed Systems: Google Cas Study
 
GCP Online Training | GCP Data Engineer Online Course
GCP Online Training | GCP Data Engineer Online CourseGCP Online Training | GCP Data Engineer Online Course
GCP Online Training | GCP Data Engineer Online Course
 
DevOps & SRE at Google Scale
DevOps & SRE at Google ScaleDevOps & SRE at Google Scale
DevOps & SRE at Google Scale
 
Seattle Scalability Meetup intro pptx - June 26
Seattle Scalability Meetup intro pptx - June 26Seattle Scalability Meetup intro pptx - June 26
Seattle Scalability Meetup intro pptx - June 26
 
JAM23-24_ppt.pptx
JAM23-24_ppt.pptxJAM23-24_ppt.pptx
JAM23-24_ppt.pptx
 
Getting started with GCP ( Google Cloud Platform)
Getting started with GCP ( Google  Cloud Platform)Getting started with GCP ( Google  Cloud Platform)
Getting started with GCP ( Google Cloud Platform)
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
 
Seminar Report on Google File System
Seminar Report on Google File SystemSeminar Report on Google File System
Seminar Report on Google File System
 
MongoDB World 2016: Lunch & Learn: Google Cloud for the Enterprise
MongoDB World 2016: Lunch & Learn: Google Cloud for the EnterpriseMongoDB World 2016: Lunch & Learn: Google Cloud for the Enterprise
MongoDB World 2016: Lunch & Learn: Google Cloud for the Enterprise
 
App Engine Application for Detecting Similar Files in Google Drive
App Engine Application for Detecting Similar Files in Google DriveApp Engine Application for Detecting Similar Files in Google Drive
App Engine Application for Detecting Similar Files in Google Drive
 
Google Final Draft With Kts
Google Final Draft With KtsGoogle Final Draft With Kts
Google Final Draft With Kts
 
BigData Meets the Federal Data Center
BigData Meets the Federal Data CenterBigData Meets the Federal Data Center
BigData Meets the Federal Data Center
 
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 

Dernier

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 

Dernier (20)

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 

Google technology

  • 1. Chapter Three: Google Technology Chapter Three: Google Technology “Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to product better search results.... Fast crawling technology is needed to gather the Web documents and keep them up to date. Storage space must be used efficiently to store indices and, optionally, the documents themselves. The indexing system must process hundreds of gigabytes of data efficiently. Queries must be handled quickly, at the rate of hundreds to thousands per second.” – Sergey Brin and Lawrence Page, 19971 In the beginning, there was BackRub, the service that became Google. Today, Google is most closely associated with its PageRank algorithm. PageRank is a voting algorithm weighted for importance. The indicators of a Web page’s importance is the number of pages that link to a particular page. Messrs. Brin and Page soon added another factor which voted for the importance of a Web page. This idea was the number of people who click on a Web page. The more clicks on a Web page, the more weight that Web page was given. Over time, still other factors have been added to the PageRank algorithm; for example, the frequency with which content on a page is changed. Google’s PageRank technology is closely allied with Internet search. Voting algorithms are less effective in enterprise search, for instance. The attention given to Google and its search technology dominate popular thinking about the company. Google search is like a nova. The 1. From “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” www.- db.standord.edu/~backrub/google.html The Google Legacy 55
  • 2. Chapter Three: Google Technology luminescence makes it difficult for the observer to see other aspects of the phenomenon clearly or easily. Radiance aside, Google is a technology company.2 Some of that technology when described in technical papers such as the earliest one “The Anatomy of a Large-Scale Hypertextual Web Search Engine” is demanding. The later papers such as “MapReduce: Simplified Data Processing on Large Clusters” can be a slow read.3 Since Google is technology, explaining what Google does in an easily-digestible meal is difficult. The diagram below provides unauthorized snapshot of Google’s computing framework. b a d c Important Google technologies that underlie this diagram of the Googleplex include: [a] modifications to Linux to permit large file sizes and other functions so as to accelerate the overall system; [b] a distributed architecture that allows applications and scaling to be “plugged in” without the type of hands-on set-up other operating systems require; [c] a technical architecture that is similar at every level of scale; [d] a Web-centric architecture that allows new types of applications to be built without a programming language limitation. 2. The annex to this monograph contains a listing of more than 60 Google patents. The list is not all-inclusive; however, it does provide the patent number and a brief description for some of Google’s most important patents. The PageRank patent belongs to the trustees of Stanford University. Google’s patent efforts have focused on systems and methods for relevance, advertising, and other core foci of the company. Google is creating a patent fence to protect its interests. 3. Jeff Dean, former Alta Vista researcher and a Google senior engineer, has been an advocate of MapReduce. His most recent papers are available on his Web page at http:// labs.google.com/people/jeff/. 56 The Google Legacy
  • 3. Chapter Three: Google Technology Google’s technology has emerged from a series of continuous improvements or what Japanese management consultants call kaizan. Each Google technical change may be inconsequential to the average user of Google. But when taken as a whole, Google’s “technological advantage” comes from Google’s incremental innovations, clever adaptations of research-computing concepts, and Byzantine tweaks to Linux. Some day, a historian of technology will be able to identify, from the hundreds of improvements that Google has engineered in the last nine years, one or two that stand with PageRank as of major importance. Critics of Google will see that the company has grafted to its core technology processes from many different sources. To illustrate, the structure of Google’s data centers and the messages passed to and from these data centers is in many ways a variant of grid computing.4 Google’s ability to read data from many computers simultaneously is reminiscent of BitTorrent’s technology.5 Google’s use of commodity or “white box” hardware in its data centers is an indication of Google’s hacker ethos. The use of memory and discs to store multiple copies of data comes from the frontiers of computing. Google’s approach to technology, then, is eclectic and in many ways represents a building block approach to large-scale systems. Google benefits from that eclecticism in several ways. First, Google’s computational framework delivers sizzling performance from low-cost hardware. Second, Google worked around the bottlenecks of such operating systems as Solaris, Windows Advanced Server, and off-the-shelf Linux. Third, Google took good programming ideas from other languages, implementing new functions and libraries to eliminate most of the manual coding required to parallelise an application across Google’s servers.6 According to Jeff Dean, one of Google’s senior engineers, “Google engineering is sort of chaotic.”7 This is neither surprising nor necessarily a negative. The Googleplex is a toy box for engineers and programmers. The tools are sophisticated. The challenges of the problems and peers make Google “the place to be” for the best and brightest technical talent in the world. The nature of creativity combined with Google’s approach to innovation make it difficult to predict the next big thing from Google. Before reviewing selected parts of Google’s technology in somewhat more detail, the diagram “Google’s Computing Framework” provides an overview of the Googleplex and some of its technologies. These will be touched upon in this section. 4. Grid computing is applying resources from many computers in a network to a single problem or application. Google uses grid-like technology in its distributed computing system. 5. BitTorrent is a peer-to-peer file distribution tool written by programmer Bram Cohen in 2001.The reference implementation is written in Python and is released under the MIT License. 6. Google has anywhere from 100,000 to 165,000 or more servers. Servers are organized into clusters. Clusters may reside within one rack or across multiple racks of servers. Some Google functions are distributed across data centers. 7. From Dr Dean’s speech at the University of Washington in October 2003. See http:// www.uwtv.org/programs/displayevent.asp?rid=2459. The Google Legacy 57
  • 4. Chapter Three: Google Technology PageRank requires a lot of computing horsepower cycles to work. When Google got underway in 1996, Messrs. Brin and Page had limited computing horsepower. In order to make PageRank work, they had to figure out how to get the PageRank algorithm to run on garden-variety computers available to them. From the beginning – and this is an important issue with regards to Google’s almost-certain collision course with Microsoft – Google had to solve both software engineering and hardware engineering issues to make Google Search viable. In fact, when discussing Google technology, it is important to keep in mind that PageRank is important only because it can run quickly in the real world, not in a sterile computer lab illuminated with the blue glow of supercomputers. The figure Google’s Fusion: Hardware and Software Engineering shows that Google’s technology framework has two areas of activity. There is the software engineering effort that focuses on PageRank and other applications. Software engineering, as used here, means writing code and thinking about how computer systems operate in order to get work done quickly. Quickly means the sub one-second response times that Google is able to maintain despite its surging growth in usage, applications and data processing. Google’s Fusion: Hardware and Software Innovations The Google phenomenon comes from the fission occurring when PageRank’s software and hardware engineering interact. Google’s technology delivers super computer applications for mass markets. The other effort focuses on hardware. Google has refined server racks, cable placement, cooling devices, and data center layout. The payoff is lower operating costs and the ability to scale as demand for computing resources increases. With faster turnaround and the 58 The Google Legacy
  • 5. Chapter Three: Google Technology elimination of such troublesome jobs as backing up data, Google’s hardware innovations give it a competitive advantage few of its rivals can equal as of mid-2005. PageRank with its layering of additional computations added over the years is a software problem of considerable difficulty. The Google system must find Web pages and perform dozens, if not hundreds of analyses of those Web pages. Consider the links pointing to a Web page. Google must keep track of them for more than eight billion Web pages. For a single Web page with one link pointing to it, the problem is trivial. One link equals one pointer. But what happens when a site has 10,000 links pointing to it? The problem becomes many times larger and more computationally demanding. Some of these links are likely to come from sites that have more traffic than others. Some of the links may come from sites that have spoofed Google for fun or profit. The calculations to sort out the “value” of each of these links adds to computational work associated with PageRank. Keeping track of these factors is a big job. Sizing up different factors against one another for a single page can be hard without a calculator to help. Take the same task and apply it by a couple of billion Web pages, and the computing task becomes one for a supercomputer. Yet this task is everyday stuff for Google and its PageRank process. Users do not give much thought to what technology underpins a routine query or the 300 million queries Google handles each day. In a single second, Google’s technology handles around 340 queries in dozens of languages from users worldwide. Google’s technology cannot be separated from search. Search was the prime mover in the Google universe. Once Messrs. Brin and Page were able to fiddle with a limited number of commodity computers and make their PageRank algorithm work, Google was headed down a road that it still follows. The software requires a suitable hardware and network infrastructure in which to operate. Without Google’s hardware and software, there would be no Google. Hardware and software are inextricably linked at Google. With each new advance in software, Google’s engineers must make correspondingly significant advances in hardware. And when hardware engineers come up with an advance, the software engineers greedily use that advance to up the functionality of their software. What Google owns is its own snappy, turbocharged supercomputer, interesting software tools, and several thousand people trying to figure out what else the Googleplex can do. Some of the tinkerers come at the problem from bits and bytes, writing code, and weaving applications out of the available functions. The result is a brilliant product. Others come at the problem from the soldering iron and screwdriver angle. These engineers look for ways to build hardware and physical systems that can perform the calculations needed to make PageRank work. Google’s approach to data centers, the racks in the data centers, and the devices in the racks in the data centers is as clever as the company’s search system. The hardware has to be more than clever. The hardware has to work 24x7, under continuous load, and in locations from Switzerland to Beijing. The synergy between software and hardware is perhaps one of Google’s major accomplishments. The Google Legacy 59
  • 6. Chapter Three: Google Technology How Google Is Different from MSN and Yahoo Google’s technology is simultaneously just like other online companies’ technology, and very different. A data center is usually a facility owned and operated by a third party where customers place their servers. The staff of the data center manage the power, air conditioning and routine maintenance. The customer specifies the computers and components. When a data center must expand, the staff of the facility may handle virtually all routine chores and may work with the customer’s engineers for certain more specialized tasks. Before looking at some significant engineering differences between Google and two of its major competitors, review this list of characteristics for a Google data center. 1 Google data centers – now numbering about two dozen, although no one outside Google knows the exact number or their locations. They come online and automatically, under the direction of the Google File System, start getting work from other data centers. These facilities, sometimes filled with 10,000 or more Google computers, find one another and configure themselves with minimal human intervention. 2 The hardware in a Google data center can be bought at a local computer store. Google uses the same types of memory, disc drives, fans and power supplies as those in a standard desktop PC. 3 Each Google server comes in a standard case called a pizza box with one important change: the plugs and ports are at the front of the box to make access faster and easier. 4 Google racks are assembled for Google to hold servers on their front and back sides. This effectively allows a standard rack, normally holding 40 pizza box servers, to hold 80. 5 A Google data center can go from a stack of parts to online operation in as little as 72 hours, unlike more typical data centers that can require a week or even a month to get additional resources online. 6 Each server, rack and data center works in a way that is similar to what is called “plug and play.” Like a mouse plugged into the USB port on a laptop, Google’s network of data centers knows when more resources have been connected. These resources, for the most part, go into operation without human intervention. Several of these factors are dependent on software. This overlap between the hardware and software competencies at Google, as previously noted, illustrates the symbiotic relationship between these two different engineering approaches. At Google, from its inception, Google software and Google hardware have been tightly coupled. Google is not a software company nor is it a hardware company. Google is, like IBM, a company that owes its existence to both hardware and software. Unlike IBM, Google has a business model that is advertiser supported. Technically, Google is conceptually closer to IBM (at one time a hardware and software company) than it is to Microsoft (primarily a software company) or Yahoo! (an integrator of multiple softwares). 60 The Google Legacy
  • 7. Chapter Three: Google Technology Software and hardware engineering cannot be easily segregated at Google. At MSN and Yahoo hardware and software are more loosely-coupled. Two examples will illustrate these differences. Microsoft – with some minor excursions into the Xbox game machine and peripherals – develops operating systems and traditional applications. Microsoft has multiple operating systems, and its engineers are hard at work on the company’s next-generation of operating systems. Microsoft does not design or make its own hardware. Its operating systems are coded, for example, for processors that evolved from the Intel chips for personal computers. Recently Microsoft embarked on a new path with its game machine, the Xbox 360. The new Xbox uses a processor from IBM’s family of PowerPC chips also used in the Macintosh computer, the Sony PS/3, and Nintendo next-generation game machines. Microsoft’s applications run on Microsoft operating systems, although a version of Microsoft Office and Internet Explorer run on Apple’s Macintosh. In addition, Microsoft buys hardware from various suppliers to run its online systems. Most of these suppliers, not surprisingly, are certified by Microsoft. Examples include Microsoft’s use of Dell Computers. Microsoft’s engineers use these machines in configurations required by the Microsoft operating systems and applications. For example, Microsoft servers often require a load balancing feature. Microsoft implements its load balancing via software. When more performance is required, Microsoft upgrades the hardware, adds memory, or shifts to higher- speed hard drive technology instead of recoding the operating system itself to deliver higher performance as Google does. Once a function is released to customers, Microsoft’s engineers focus on stamping out bugs. Re-engineering a software application for higher performance is not typically a priority. Several observations are warranted: 1 Unlike Google, Microsoft does not focus on performance as an end in itself. As a result, Microsoft gets performance the way most computer users do. Microsoft buys or upgrades machines. Microsoft does not fiddle with its operating systems and their subfunctions to get that extra time slice or two out of the hardware. 2 Unlike Google, Microsoft has to support many operating systems and invest time and energy in making certain that important legacy applications such as Microsoft Office or SQLServer can run on these new operating systems. Microsoft has a boat anchor tied to its engineer’s ankles. The boat anchor is the need to ensure that legacy code works in Microsoft’s latest and greatest operating systems. 3 Unlike Google, Microsoft has no significant track record in designing and building hardware for distributed, massively parallelised computing. The mice and keyboards were a success. Microsoft has continued to lose money on the Xbox, and the sudden demise of Microsoft’s entry into the home network hardware market provides more evidence that Microsoft does not have a hardware competency equal to Google’s. The Google Legacy 61
  • 8. Chapter Three: Google Technology In terms of technology, Google has the hardware and software engineering expertise to build applications rapidly, perform computationally-intensive applications quickly, and deliver high-reliability services from low-cost, commodity hardware. Yahoo! operates differently from both Google and Microsoft. Yahoo! is in mid-2005 a direct competitor to Google for advertising dollars. Yahoo! has grown through acquisitions. In search, for example, Yahoo acquired 3721.com to handle Chinese language search and retrieval. Yahoo bought Inktomi to provide Web search. Yahoo bought Stata Labs in order to provide users with search and retrieval of their Yahoo! mail. Yahoo! also owns AllTheWeb.com, a Web search site created by FAST Search & Transfer. Yahoo! owns the Overture search technology used by advertisers to locate key words to bid on. Yahoo! owns Alta Vista, the Web search system developed by Digital Equipment Corp. Yahoo! licenses InQuira search for customer support functions. Yahoo has a jumble of search technology; Google has one search technology. Historically Yahoo has acquired technology companies and allowed each company to operate its technology in a silo. Integration of these different technologies is a time-consuming, expensive activity for Yahoo. Each of these software applications requires servers and systems particular to each technology. The result is that Yahoo has a mosaic of operating systems, hardware and systems. Yahoo!’s problem is different from Microsoft’s legacy boat-anchor problem. Yahoo! faces a Balkan-states problem. There are many voices, many needs, and many opposing interests. Yahoo! must invest in management resources to keep the peace. Yahoo! does not have a core competency in hardware engineering for performance and consistency. Yahoo! may well have considerable competency in supporting a crazy-quilt of hardware and operating systems, however. Yahoo! is not a software engineering company. Its engineers make functions from disparate systems available via a portal. Google also acquires technology. A good example is Picasa. The photo management software runs on the user’s Windows PC. The program has been integrated with several of Google’s network-centric applications: 1 Gmail. The user’s images can be uploaded and sent via email to friends, colleagues and family. A Picasa user without a Gmail account is able to register and receive a user name and password. The Gmail account can also be used, if the user wishes, for other Google services, including Fusion, which is Google’s personalized portal, and the search history function, which saves a registered user’s Google queries for later reference. 2 Blog Publishing. The user can post pictures to a Google property, Blogger.com. The image publishing function is simplified to one or two clicks. Posting images on some Web log systems is beyond the expertise of many computer users. 3 Image Printing. The user can send images to online photo processing services. 62 The Google Legacy
  • 9. Chapter Three: Google Technology One-click access to functions performed on the user’s local computer. Recently-viewed images One-click access to network services available as part of the user’s virtual application. In sharp contrast to Yahoo’s approach, Google integrated the Picasa application into the Googleplex. The “hooks” are painless to the user.8 Google has bundled into one free application point-and-click solutions to make management of digital still images intuitive and fluid. Yahoo!’s acquisitions, in general, are not woven into a seamless experience with other Yahoo! services. Consider the 3721.com search system. That service remains a separate Chinese language operation available from mostly non-English Yahoo pages. Google constructs an application using some code on the user’s PC and other software running on the Googleplex somewhere on the Internet. These three companies, different in structure and technical focus, are on a collision course. Like vessels in America’s Cup, each is going toward the same goal, but subject to forces difficult for their helmsman to control. Even though there is market space between the three, 8. Picasa requires a download. The installation process is smooth. Indexing speed was about five times faster than ACDSee’s image management program, a competitive product. With Picasa, Google’s technologists demonstrate a rapid, trouble-free installation and an intuitive interface. The Google Legacy 63
  • 10. Chapter Three: Google Technology collisions are inevitable. The figure below provides an overview of the mid-2005 technical orientation of Google, Microsoft and Yahoo. MSN, and by extension Microsoft Corporation, has a core competency in software. The company has grown from its operating system roots to provide a range of products for mobile devices, desktop and notebook computers, and enterprise-class servers. Looking forward, the company’s Dot Net technology is Microsoft’s framework for virtual applications. In some ways, Dot Net is a less-open version of the AJAX technology that Google uses in the Google Maps and Gmail products. Microsoft has expended great effort to push Windows downward to mobile devices and outward to network-centric computers in an effort to increase revenue. For Microsoft to continue to be the dominant force in software in the future, the company must be able to capture a commanding share of the market for network-centric applications. However, Microsoft’s position (whether real or perceived) is its products’ vulnerability to security breaches. Patch after patch, problem after problem, then promise after promise have done little to bolster the firm’s credibility for delivering secure systems and software. Looking forward over the next 12 to 18 months, Microsoft’s prospects hinge on security, cost and its developer community. The growth of open source alternatives are hard proof that die-hard Microsoft users are willing to shift for security, cost savings and functionality. Microsoft has weaknesses that can be attacked by Google and other competitors. Yahoo’s situation is typical to many American organizations. Most large US corporations are a hotch-potch of different systems, incompatible architectures and a Tower of Babel of data formats. For Yahoo to deliver specific markets to its advertisers, Yahoo must integrate information from disparate systems and be able to segment and deliver ads to those users efficiently. Yahoo is now spending money to break down the walls of its data silos and integrating its user data. If Yahoo cannot deliver narrowly segmented markets, advertisers may abandon Yahoo for services that offer more targeted marketing opportunities. After years of flirting with becoming a New Age America Online, Yahoo is beginning to behave like a traditional media company. 64 The Google Legacy
  • 11. Chapter Three: Google Technology MSN and Yahoo! are becoming ad-supported versions of general-interest portals like Yahoo, America Online and Tiscali. In contrast, Google is focusing on applications that tie users to its Googleplex. The company’s focus on hardware and software engineering gives it a cost and performance advantage over MSN and Yahoo, among others competing in Web search. Google’s high-performance, homogeneous Googleplex means that the company does not struggle with some integration, performance and cost issues that bedevil Microsoft and MSN. Google may not be doing everything right from a computer science point of view. Compared to MSN or Yahoo, Google is doing less wrong than these two aggressive competitors. The Technology Precepts Google’s technology uses concepts and techniques from the leading edge of computer science. Most of these innovations are difficult to explain to engineers steeped in traditional approaches to massively distributed, highly parallelized computing. The eclectic footnotes and references in the earlier BackRub paper have been sharpened in Google’s later technical presentations. Readers without a first-hand understanding of NOW-Sort, River, and BAD-FS are unlikely to craft dinner conversation from Google’s explanations of the influence of these research computing demonstrations.9 For the purposes of this monograph and understanding the nature of Google’s technology, five precepts thread through Google’s technical papers and presentations. The following snapshots are extreme simplifications of complex, yet extremely fundamental, aspects of the Googleplex. Cheap Hardware and Smart Software Google’s use of commodity hardware for high-demand, 24x7 systems has existed as a core precept since 1996. Most of its competitors’ online systems combine branded hardware from IBM, Sun Microsystems, Hewlett-Packard, and Dell Computers with specialized peripherals. The operating systems in use are a combination of Unix and Microsoft operating systems with some Linux and open source components. Google approaches the problem of reducing the costs of hardware, set up, burn-in and maintenance pragmatically. A large number of cheap devices using off-the-shelf commodity controllers, cables and memory reduces costs. But cheap hardware fails. In order to minimize the “cost” of failure, Google conceived of smart software that would perform whatever tasks were needed when hardware devices fail. A single device or an entire rack of devices could crash, and the overall system would not fail. More important, when such a crash occurs, no full-time systems engineering team has to perform technical triage at 3 a.m. 9. See for example Andrea C. Arpaci-Dusseau, et. al. “HIgh Performance Sorting on Network of Workstations”. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, May 1997 or John Bent, et. al. “Explicit Control in a Batch-Aware Distributed File System”. Both contained in Proceedings of the 1st USENIX Symposium on Networked Systems Design and Implementation. March 2004. The Google Legacy 65
  • 12. Chapter Three: Google Technology The focus on low-cost, commodity hardware and smart software is part of the Google culture. In one presentation at a December 2004 technical conference, a Google spokesman joked that anyone in the room could buy the same hardware that Google uses at Frye’s Electronics, a retail chain with stores in Palo Alto and other cities in California. Logical Architecture Google’s technical papers do not describe the architecture of the Googleplex as self-similar. Google’s technical papers provide tantalizing glimpses of an approach to online systems that makes a single server share features and functions of a cluster of servers, a complete data center, and a group of Google’s data centers. The diagram below shows a representation of the Googleplex’s tightly organized, highly regular organization of files, servers, clusters, and more than two dozen data centers in a stable organizational pattern.10 The Googleplex A data centre is a larger uses the same instance of the design and is organization of composed of a single pizza racks. box server. A single Google cluster embodies the same organizing A single principle as a replicated single pizza box Google file server reflects the controllling A single Google organizing pizza box server principle The diagram illustrates that Google’s technical infrastructure is similar at every level in the Googleplex. The collections of servers running Google applications on the Google version of Linux is a supercomputer. The Googleplex can perform mundane computing chores like taking a user’s query and matching it to documents Google has indexed. Further more, the Googleplex can perform side calculations needed to embed ads in the results pages shown to user, execute parallelized, high-speed data transfers like computers running state-of-the-art storage devices, and handle necessary housekeeping chores for usage tracking and billing. 10.The illustration is a Sierpinkski Triangle, chosen because it conveys how each component in Google’s infrastructure replicates other larger combinations of servers and data centers. The overall structure – in this illustration an equilateral triangle – expresses the stability of the Google approach to its system. This famous fractal connotes how Google scales without altering the micro or macro structure of the Googleplex. 66 The Google Legacy
  • 13. Chapter Three: Google Technology What is of interest is that Google does this with low-cost commodity hardware running on Google’s version of Linux. Google has infused the Googleplex with logic that allows software to handle data recovery, to streamline messages passed from server to server, and to grab additional computing resources in order to complete a job quickly. When Google needs to add processing capacity or additional storage, Google’s engineers plug in the needed resources. Due to self-similarity, the Googleplex can recognize, configure and use the new resource. Google has an almost unlimited flexibility with regard to scaling and accessing the capabilities of the Googleplex. Unlike a collection of different building materials, Google’s approach delivers a homogeneous computing system. A good example is bringing a new rack of 40 or more pizza box servers online and creating one of the many types of servers Google users.11 Servers, according to the fractal architecture, consist of two or more clusters of pizza boxes. A cluster allows data to be replicated and work shared among pizza boxes with spare capacity. A rack is assembled and then Google’s pizza box servers are “plugged in.” Cables are attached among the pizza boxes and the rack is then plugged into a network hub. An engineer turns on the power, and the other devices become aware of the new rack’s resources. Master servers – Google’s term for the pizza box that is in charge of one or more clusters – instruct other servers to copy data to the new cluster and begin using the clusters to do work. In Google’s self-similar architecture, the loss of an individual device is irrelevant. In fact, a rack or a data center can fail without data loss or taking the Googleplex down. The Google operating system ensures that each file is written three to six times to different storage devices. When a copy of that file is not available, the Googleplex consults a log for the location of the copies of the needed file. The application then uses that replica of the needed file and continues with the job’s processing. Redundancy and other engineering tweaks to Linux gives the Googleplex ways to eliminate or reduce the bottlenecks associated with traditional online computer systems’ operation. The Google technical recipe includes distributed computing, optimized file handling, and embedded logic to make the servers working on tasks smarter. This architecture allows Google to expand its computational capacity, its storage and its supported applications with an ease and price point rivals cannot easily match. According to Jeff Dean, one of Google’s senior engineers, “At Google, everything is about scale.”12 Speed and Then More Speed Google Search is fast with most results coming back to the user in less than one second. In commercial data centers, speed has traditionally been achieved by buying high-end, high- performance hardware from such manufacturers such as Sun Microsystems and using advanced storage devices connected to the servers by exotic fibre optics. 11.Data centers use computer cases that are shaped like the boxes used to hold pizzas. The term pizza boxes has been appropriated by engineers to describe one of the standard form factors for servers housed in rack mounts in data centers. 12.Statement made at the University of Washington, October 2004 The Google Legacy 67
  • 14. Chapter Three: Google Technology Not Google. Google uses commodity pizza box servers organized in a cluster. A cluster is group of computers that are joined together to create a more robust system. Instead of using exotic servers with eight or more processors, Google generally uses servers that have two processors similar to those found in a typical home computer. Through proprietary changes to Linux and other engineering innovations, Google is able to achieve supercomputer performance from components that are cheap and widely available. The table below provides some data from 2002 about the speed with which Google can read data from hard drives:13 These data show the results of two clusters’ performance. Google’s read throughput has gone up since 2002. Based on increases in commodity drive throughput, Google’s read rate may be close to 2,000 megabytes per second, which may be a Google watchers enthusiasm boosting already-robust figures. To put these data in a context of 2002 technology, consider that an IBM EXP3 storage device available in 2002 could read data in burst mode at the rate of about 58 MB / second. Google’s read rate in 2002 averaged ten times the read rate of the IBM EXP The write rate is comparable. The cost of a single IBM EXP3 in 2002 was about $18,000 for 360 gigabytes of storage, excluding controller and cables. Google’s cost for comparable storage and the higher performance was about $1,000. For greater speed, Google spends less. In the world of ever- increasing demands for speed and storage, Google has a strong one-two punch.14 Advances in commodity storage devices translate to even faster performance for Google. Google has not updated its read rate data, but engineers familiar with Google believe that read rates may in some clusters approach 2,000 megabytes a second. When commodity hardware gets better, Google runs faster without paying a premium for that performance gain. Google engineers for computational speed. Google’s approach has been to focus on making its software engineering produce the turbocharged performance. Speed is crucial to Google’s PageRank and other analytic processes. If Google’s computational throughput were slow, Google could not perform the work needed to know that for a particular query, a particular set 13.From “The Google File System” by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (Google) ACM SOSP 2003 Conference Proceedings 1-58113-757-5/03/0010, page 12. 14.With Google’s advanced programming tools, Google is able to increase the productivity of its engineers. Combined with hardware speed and performance, Google squeezes out more productivity by applying its engineering talents to application development. This is a one-two- three punch to which Google’s competitors have to respond. 68 The Google Legacy
  • 15. Chapter Three: Google Technology of indexed Web pages is the best match. Without fast response to a query, users would not be willing to run multiple queries and interact fluidly with the Google applications. Google does not mindlessly match key words in a user’s query to the terms in the Google index. Google’s approach is more subtle and computationally involved, although term matching is an important part of the Google process. Google reviews data, various scores or values from certain algorithms. Google then uses these different values in other algorithms to find search results, identify the best match (Google’s “Feeling Lucky” link), extract matching ads from its advertising server, and continuously update values as Google users of click on links. Once these various query and ad matching processes are complete, Google displays the results page to the user; typically in less than one second across a public network. Google is a hot rod computer that can perform the basic mathematics needed to deliver most search results in less than a half second, display maps with the speed of a dedicated desktop application like Encarta, and look at a Web page matching a user’s query and, in some applications, insert additional hyperlinks to related content before displaying the results page to the user. The Googleplex does experience slow downs. When these occur, the Googleplex allocates additional resources to eliminate the brown out. Speed has many meanings at Google. Speed means that users can interact with the Google products and services as if the Google application were running on a dedicated PC in front of the user. Speed also means that Google must be able to expand its computational and storage capacity quickly. Speed also means rapid development and deployment of new products. Speed, like Google’s ability to scale, is a core functionality of the Googleplex. Google applies its high-speed technology to search and to other types of servers. Among the servers using Google’s go-fast technology are those shown below: Type Function Advertising server Delivers text and other paid advertisements for AdWords and AdSense. Chunkserver Schedules and delivers blocks of data for further processing. Image servers Serves images for Google Image, Print and Video services. Index server The workhorse of search. Server handles search-and-retrieval. Mail server Delivers the Gmail service. News server Gathers, analyses and displays news. Web server Orders results and makes them available to users. What does the combination of go-fast technology plus multiple types of Google data allow the company to do? Google can engage in fast new product development. One example is Google Maps. Google developed a basic mapping product over the course of 2004. In late 2004, Google purchased Keyhole. By June 30, 2005, Google had: 1 Released a basic mapping product. The Google Legacy 69
  • 16. Chapter Three: Google Technology 2 Integrated information from Google Local in early 2005. 3 Hooked Keyhole satellite imagery into Google Maps in early May 2005. 4 Announced Google Earth in May 2005. 5 Upgraded the system to integrate two dimensional point-to-point routes on top of satellite imagery. 6 Demonstrated a function that accepts a query in another language, translates the results to the user’s language, and displays the data in a three-dimensional mode. The image below shows that Google’s Map and Earth service pushes the functions of online map and data integration to another level. In the span of several days, Google integrated Keyhole technology, launched, upgraded and redefined online mapping services.15 This is the results of a Japanese language Google Maps-Earth query for the location of Wendy’s restaurants in New York City. The addition of the Japanese language support, the three-dimensional view of the section of Manhattan where the user wants directions, and the integration of hot links, the two dimensional map, and information about the restaurants was part of Google’s fast-cycle launch and enhancement program designed to beat Microsoft to the market. Another key notion of speed at Google concerns writing computer programs to deploy to Google users. Google has developed short cuts to programming. An example is Google’s creating a library of canned functions to make it easy for a programmer to optimize a program to run on the Googleplex computer. At Microsoft or Yahoo, a programmer must write some 15.The source for this image was http://blog.eee-craft.com/archives/23345086.html. 70 The Google Legacy
  • 17. Chapter Three: Google Technology code or fiddle with code to get different pieces of a program to execute simultaneously using multiple processors. Not at Google. A programmer writes a program, uses a function from a Google bundle of canned routines, and lets the Googleplex handle the details. Google’s programmers are freed from much of the tedium associated with writing software for a distributed, parallel computer. What does increased programmer productivity mean? In terms of money, Google makes each engineering dollar go farther. If a single programmer can reduce by 10 percent the time required to code a program, the savings could be several thousand dollars. If a programmer can slash coding time in half, Google gets twice the potential productivity out of each of its 3,000 plus programmers.16 Eliminate or Reduce Certain System Expenses Some lucky investors jumped on the Google bandwagon early. Nevertheless, Google was frugal, partly by necessity and partly by design. The focus on frugality influenced many hardware and software engineering decisions at the company. Spending money wisely does not mean cheaply. Examples of how Google eliminates or reduces certain system expenses include: • Google eliminates the costs associated with backing up and restoring data when a hardware failure occurs. The fractal principal requires that Google replicate data three to six times elsewhere in the Googleplex. When a device fails, the “master server” for a task looks at a file that tells where the other copies of the data or the programs are. The “master server” then uses those data or those processes to complete a task. No tape, no human intervention, and no downtime; Google does not have these costs due to its engineering acumen. • Google does not have to certify new hardware. When additional storage or computational capacity is required, Google technicians assemble one or more racks of Google “pizza boxes.” Once in the rack, the Googleplex recognizes the new resources in a way that is similar to how a laptop knows when a user plugs in a USB mouse. The expensive certification processes otherwise required for some high-end hardware are eliminated. Google engineers plug in resources and let the Googleplex handle the other tasks. • Google innovation uses open source code as a starting point. Many of Google’s most striking technical advances are based on modifying open source software to benefit from insights gained from experimental results in supercomputing. Google does not have to work around known bottlenecks in some commercial operating systems. Unlike Microsoft, Google did not write a complete operating system for its Googleplex. Google made key changes to Linux, adding necessary services and functions to meet the specific requirements of Google applications. Google’s approach is pragmatic and less time- 16.Some Google programmers have complained about the peer pressure to perform. Google management faces a challenge in managing its programming talent. Staff burn out or defections could impair Google’s technical resources. The Google Legacy 71
  • 18. Chapter Three: Google Technology consuming than Microsoft’s “death march” to get Longhorn shipped by late 2006. Compared with Yahoo, Google’s approach is more cohesive. Yahoo faces integration drudgery as a result of its multiple systems and heterogeneous hardware and data. Google has used Linux, standards, and open source software for virtually all of its core services and thus spends less time pounding disparate systems and data into a standard type.17 • Google does not spend money for high-performance devices to make its system perform faster. To illustrate the financial payoff from the use of commodity hardware, Google engineers revealed a back-of-the-envelope calculation. Although dated, it underscores the economies of the Google approach:18 The cost advantages of using inexpensive, PC-based clusters over high-end multiprocessor servers can be quite substantial, at least for a highly parallelisable application like ours. For example, a $278,000 rack contains 176 2-GHz Xeon CPUs, 176 Gbytes of RAM, and 7 Tbytes of disk space. In comparison, a typical x86-based server contains eight 2-GHz Xeon CPUs, 64 Gbytes of RAM, and 8 Tbytes of disk space; it costs about $758,000. In other words, the multi-processor server is about three times more expensive but has 22 times fewer CPUs, three times less RAM, and slightly more disk space. Much of the cost difference derives from the much higher interconnect bandwidth and reliability of a high-end server, but again, Google’s highly redundant architecture does not rely on either of these attributes. [Emphasis added] This means that when Microsoft of Yahoo! spends US$3.00 for better performance, Google spends less than US$1.00.19 Over time, competitors such as IBM, Microsoft or Yahoo may implement similar features into their network-centric services. Until then, Google has a cost advantage at least with regards to scaling online operations. If these 2002 data can be accepted, Google spends one-third for more computing horsepower and disc space than companies spend using a traditional server architecture. Snapshots of Google Technology Google engineers generate a large volume of technical information. Some of the data are in the form of patents, often written in a style that communicates little of the patent’s substance to a lay reader. The link for Google’s publications can shift unexpectedly.20 Exploring 17.Google does not explicitly state that it has embraced a services oriented architecture or SOA. However, many of Google’s practices illustrate an informed use of certain features of SOA. 18.Luiz André Barroso, Jeffrey Dean, and Urs Hölzle, “Web Search for a Planet: The Google Cluster Architecture”, IEEE Computer Society 0272-1732/03, March April 2003. 19.A review of Google’s cost estimates for this monograph revealed that Google is understating its cost advantage by one or two orders of magnitude. As the performance of commodity hardware goes up, the cost of that hardware goes down. Bulk purchasing chops as much as 50 percent off the cost of some hardware. Google can replicate its data and give away free gigabytes of email storage. The cost to Google can be as low as a few cents a gigabyte. 20.See http://labs.google.com/papers.html#compilers on June 1, 2005. 72 The Google Legacy
  • 19. Chapter Three: Google Technology biographies of Google executives and Google Web logs can yield some useful technical information. For example, one Google biography linked to more than 36 personal projects, including one by Google’s CEO.21 Surprisingly, Google’s search engine does a hit-and-miss job of indexing Google’s own technical information. Useful engineering information appears on the Google Web site. The topics covered in various monographs, white papers and technical notes concern a wide range of subjects. For example, in mid-2005, papers were available on such topics as algorithms, compiler optimization, information retrieval, artificial intelligence, file system design, data mining, genetic algorithms, software engineering and design, and operating systems and distributed systems, among others. Google explains its use of very large files as well as how the Google-modified version of Linux automatically allocates work and avoids the file system bottlenecks that can plague Solaris and Windows Advanced Server 2003, among others. Google’s technical papers and Google patents provide some insight into areas of interest at Google. For example, Google is posting more information about operating systems and applications. The thrust of Google’s innovation is to build out the search platform and expand the functionality of its backoffice programs such as those used for advertising services. The annex to this monograph provides information about more than 60 patents for which Google is believed to be the assignee. To provide a more fine-grained look at Google technology, the table below identifies selected examples of innovations documented by Google engineers or researchers close to the company. Most of these papers appeared prior to Google’s receiving a patent for the technology referenced in these reports: Technology Purpose To Learn More Google Suggest Helps users find needed information Services Computing, 2004 IEEE by analysing queries and suggesting International Conference on (SCC'04) by other queries. Stephen Davies, Serdar Badem, Michael D. Williams, Roger King September 2004. Video Object Search User types an object name and Google Ninth IEEE International Conference on finds that object in a video. Computer Vision Volume 2 Josef Sivic, Andrew Zisserman Publication Date: October 2003. MapReduce New functions in Google Linux to OSDI Proceedings, December 2004. speed programming and other processes involving large data sets. Google File System Extension to Google Linux to allow ACM Publication 1-58113-757-5/03/ high-speed data reads and writes from 0010. commodity drives. 21.This is the lex project that “helps write programs whose control flow is directed by instances of regular expressions in the input stream. It is well suited for editor-script type transformations and for segmenting input in preparation for a parsing routine.” The Google Legacy 73
  • 20. Chapter Three: Google Technology Technology Purpose To Learn More Identify Authoritative or Uses pattern mining in order to Seventh International Database High-Value Sources in generate a numeric value to indicate Engineering and Applications Web Content an authoritative source as an Symposium (IDEAS'03) Haofeng Zhou, indication of content quality. Yubo Lou, Qingqing Yuan, Wilfred Ng, Wei Wang, Baile Shi July 2003. MetaCrystal Metasearch technology to allow a Second International Conference on single query to retrieve and organize Coordinated & Multiple Views in results in a visual display. Exploratory Visualization (CMV'04) Anselm Spoerri July 2004. Drawbacks of the Googleplex The coaching mantra, “No pain without gain” is true for Google. Google does make mistakes: and some big ones. The example fresh in news headlines is Web Accelerator. The product was introduced in May 2005 and withdrawn less than six weeks later. Speed and nimbleness aside, Web Accelerator was technology that ran head on into “issues.”Of greater consequence are the periodic slowdowns for Gmail. The Googleplex is scalable, but until more servers are online, users may face annoying delays. Going Too Fast: The Google Web Accelerator The Web Accelerator software was supposed to use Google servers to store Web pages a user viewed. Web Accelerator parsed a page in the user’s browser. The Web Accelerator function then followed each link on that specific page. The page was then stored in a Google cache. When the user clicked on a link, the user would see the page from the Google cache, thus reducing the time required to display the page to the user. Web Accelerator worked fine on such sites as a www.whitehouse.gov, which makes minimal use of advanced Web services. Unfortunately, the Web Accelerator function followed links that transmitted instructions to Web applications. For example, Web Accelerator would click on “delete” links, causing some Web applications such as Backpack to remove the user’s preferences or content.22 Web Accelerator blithely ignored confirmations generated by JavaScript so that unintentional instructions were transmitted. Some Google watchers raised questions about caching data as well as privacy and copyright issues. Before these concerns reached a crescendo, Google reported that Web Accelerator had reached its capacity. Google blocked downloads for the product. The Laws of Physics: Heat and Power 101 Google does not reveal the number of servers it uses, but the number is believed to be in the 150,000 to 170,000 range as of June 30, 2005. Conflicting information surfaces in Web logs and in talks at conferences. In reality, no one knows. Google has a rapidly expanding number of data centers. The data center near Atlanta, Georgia, is one of the newest deployed. This 22.Backpack is a Web application that sends a user the contents of any page as email. See www.backpackit.com. 74 The Google Legacy
  • 21. Chapter Three: Google Technology state-of-the-art facility reflects what Google engineers have learned about heat and power issues in its other data centers. Within the last 12 months, Google has shifted from concentrating its servers at about a dozen data centers, each with 10,000 or more servers, to about 60 data centers, each with fewer machines.23 The change is a response to the heat and power issues associated with larger concentrations of Google servers. The most failure prone components are: • Fans. • IDE drives which fail at the rate of one per 1,000 drives per day. • Power supplies which fail at a lower rate. Repairs are batch operations. Scheduling the fixes is a major job and work is underway to improve the Google-developed scheduling capability. Google has to locate hosting facilities that can meet the company’s heat and power requirements. Other Data Center Issues Google data centers have access to multiple high-speed lines and normal data center functions such as redundant power, traffic routing and strict rules governing access to the physical boxes. PRWeaver’s Web log contained a posting of a photograph allegedly taken inside a Google data center. If true, the physical layout of the racks holding an estimated 2,000 or more servers squeezes a large amount of hardware in a tightly-packed space. This type of dense configuration helps explain the comments about Google’s heat and power concerns. Most data centers were not designed to handle dense concentrations of thousands of servers. Heat contributes to hard drive failures. On the plus side, the dense configuration makes set up and maintenance somewhat easier. Google packs servers on two sides of a rack. A unique property of the data centers is that replicated content can be written from one data center to another. Google data within the data center are replicated on other servers and other clusters running in the racks. The Google “plug and play” engineering philosophy appears to be used in and across data centers. If a data center, such as the one shown above, needs additional index server capacity, the technicians in that center can build a Google rack of 40 pizza box servers. These servers are connected to the network. When the rack is powered up, it becomes available to the master servers for that data center. These master servers then mark the rack’s resources as available. Master servers then begin sending work to the new devices. The information about data 23.These data appear at www.mcdar.net/SEOTools.htm The Google Legacy 75
  • 22. Chapter Three: Google Technology centers indicates that this “plug and play” concept and automatic discovery of new resources applies to new data centers, not just the racks within them. It may be an exaggeration that a Google rack and the data center in which the rack resides works like a USB mouse. The general concept seems to be what Google engineers have tried to achieve. By eliminating such tasks as certifying and configuring Small Computer System Interface RAID storage devices, Google is content to let the auto-discovery functionality alert a “master server” to a new resource, master servers to alert other master servers, masters to notify clients of tasks, and data centers to pass information that racks, clusters or a new data center are available for use. A a Google engineer said, “Wherever we put a cluster, we have heat, cooling and power issues. When we put in a data center, that data center operator faces new challenges. We use each day four megawatts of electric power.” The problems include: 1 Heat. Special racks with fans that cool the core of the rack are used. 2 Power. The power demand at load is greater than data centers typically sustain. “Our cages are custom built and there’s a lot of work done by us and the data center people before we can flip the switch,” said Jeff Dean, a senior Google engineer. 3 Network management tools. Google has had to create network management tools to manage its self-healing, automatic failover operating system. What’s Up, Sergey? The Google data centers are concentrated in North America with other data centers located in Switzerland, the Pacific Rim, and Beijing.24 Because the GOS is self-healing, the operating system and the various “master computers” in a cluster know what device is online and what device is dead. Off-the-shelf network management tools are not tailored to Google’s requirements. Therefore, Google is developing network management and monitoring tools so that the information in the Google operating system log files can be displayed in a meaningful way to Google network engineers. The overall Googleplex works and continues working even if a device, rack or data center goes dark or dies. Network management tools have to provide a broad range of monitoring and support functions for the global network, devices, data flows, work loads and potential problem areas. Google is developing needed network management tools specifically for its the Googleplex. 24.The Beijing data center was purpose built to conform to the ruling body’s requirements for online access, monitoring and related issues. Google complied in order to do business in China. Yahoo! bought 3721.com in order to accelerate its effort in China. 76 The Google Legacy
  • 23. Chapter Three: Google Technology Unanticipated Faults Could Derail Google’s Juggernaut Google’s network uses a number of concepts from the fringes of computer innovation as well as its hands-on knowledge gained by from the Googleplex itself. The result is a highly- resilient network that may breed problems not previously encountered. Although Google has operated for more than five years without downtime from system failure, the possibility – however remote – does exist that something unanticipated could occur. A sufficiently large problem could deal Google a severe blow. The advanced technology of Google’s MapReduce tool and its 400 module library could pose as yet unforeseen technical problems. The diagram shows how Google’s approach eliminates the bottleneck in parallelized systems produced by excessive message traffic flowing through a server coordinating work among different computers. This is a diagram produced by Google engineers. Summary of Google’s Drawbacks Critics of Google can point to three “problems” with Google’s approach to performance. First, Google is a one-trick pony. The changes to Linux and the other technical modifications are little more than hackers’ attempts to squeeze a small performance gain. Second, Google’s use of commodity hardware and cheap storage is a risky solution. Unknown problems may lurk when cheap components are used in a mission-critical system. Increasing the potential risk are the changes Google makes to speed up program execution. The Google Legacy 77
  • 24. Chapter Three: Google Technology Finally, other operating systems – including those from computer research laboratories and even Microsoft – do the same things and have for years. Leveraging the Googleplex Google has demonstrated that search is just one application that can run in the Google environment. There are many other applications that can benefit from Google’s approach to online services. 1 Applications that require a high performance payoff for a low cost such as electronic mail. 2 An application that can run in Google’s redundant environment where there is no private-state replication such as found in IBM’s AS/400 operating environment and others. 3 Computationally-intensive, stateless applications. 4 Applications that require request-level parallelism, a characteristic exploitable by running individual requests on separate servers such as Google Earth. There is little to be gained by trotting out war-horses to trample Google. The user experience speaks for itself. Google’s approach to massively-parallel distributed computing works, even on dial-up networks. Google fused the type of thinking associated with small, cash-strapped companies with techniques from advanced computer systems. Commodity products keep costs down. A modified Linux delivers fast performance at a bargain basement cost. Google is taking a strategic risk with commodity hardware and a souped up version of Linux. Each day Google bets that its technologists can keep the system humming. Another reason why Google’s approach to technology is paying off is that Google employes the same pragmatism and cleverness in application development. Google uses standard engineering practices, proprietary knowledge, and off-the-shelf techniques such as its use of Web services. Google uses the same Web programming techniques that millions of Web developers use. The payoff is that it is easy for Google to hire people who can code for the Googleplex. Google so far has not had to spend money for developer marketing programs or train new hires to work in the Googleplex. The biggest boost to Google’s technical approach is that its competitors are following different, more expensive approaches. Yahoo is a fruit cake of hardware, operating systems, and applications coded at different times in different languages by different people. Microsoft uses its own operating systems but relies on other operating systems as well, including Solaris. Microsoft’s must invest in hardware to squeeze performance out of its platforms. Yahoo wrestles with its many different platforms. Microsoft seems powerless to enhance the speed of its operating system. Both are digital ostriches burying their heads in their own marketing material. 78 The Google Legacy
  • 25. Chapter Three: Google Technology Google’s technology is one major challenge to Microsoft and Yahoo. So to conclude this cursory and vastly simplified look at Google technology, consider these items: 1 Google is fast anywhere in the world. 2 Google learns. When the heat and power problems at dense data centers surfaced, Google introduced cooling and power conservation innovations to its two dozen data centers. 3 Programmers want to work at Google. “Google has cachet,” said one recent University of Washington graduate. 4 Google’s operating and scaling costs are lower than most other firms offering similar businesses. 5 Google squeezes more work out of programmers and engineers by design. 6 Google does not break down, or at least it has not gone offline since 2000. 7 Google’s Googleplex can deliver desktop-server applications now. 8 Google’s applications install and update without burdening the user with gory details and messy crashes. 9 Google’s patents provide basic technology insight pertinent to Google’s core functionality. A young programmer in Osaka or Beijing is very likely to have been influenced by Google. The skilled programmers want to work at Google, develop for the Googleplex, and, if possible, create their own Google killer. The mantra is, “Be like Sergey and Larry”. Google has a next-generation computing platform. That platform is optimised to deliver virtual applications to its users worldwide. Google uses standard Web technologies in clever ways. Although the technical challenges facing Google are formidable, the company has advanced the art of online computing. The Google Legacy 79