A Basic Explanation of How Google Works
At the end of 2013, almost 70% of all web searches were done on Google.
In 2013 there were over 2 trillion searches on Google, averaging almost 6 billion searches per day.
Google was started in 1998 and has grown steadily every year.
Google is know for asking very interesting (hard) interview questions.
Google generates income of $20 billion a year.
16% of the searches Google sees are new searches.
Google is VERY secretive about it’s Algorithms and it’s Data Centers.
Google’s most recent major algorithm change, (to their Panda system) was made on 20 May 2014.
Google has over 19 data centers in the US and at least 17 more around the world.
56% of internet users Google themselves.
Networking in the Penumbra presented by Geoff Huston at NZNOG
A Peek Behind the Curtain
1. A Peek Behind the Curtain
A Basic Explanation of How Google Works
Caryn Brown – 22 May 2014 – BNI
2. GOOGLE’S MISSION STATMENT
To organize the world's information to make it universally
accessible and useful.
22 May 2014 www.DigitalMediaButterfly.com 2
3. GOOGLE FACTS
1. At the end of 2013, almost 70% of all web searches were done on Google.
2. In 2013 there were over 2 trillion searches on Google, averaging almost 6
billion searches per day.
3. Google was started in 1998 and has grown steadily every year.
4. Google is know for asking very interesting (hard) interview questions.
5. Google generates income of $20 billion a year.
6. 16% of the searches Google sees are new searches.
7. Google is VERY secretive about it’s Algorithms and it’s Data Centers.
8. Google’s most recent major algorithm change, (to their Panda system) was
made on 20 May 2014.
9. Google has over 19 data centers in the US and at least 17 more around the
world.
10. 56% of internet users Google themselves.
22 May 2014 www.DigitalMediaButterfly.com 3
4. GOOGLE’S USER INTERFACE AND DESIGN
• Google’s approach is to keep the user interface clean and
simple.
• All changes are put through user studies, analysis, and
testing.
• Google is concerned both about simplicity and about
server stability.
• The User Interface design is the responsibility of cross-functional
teams, including psychologists, business
analysts, and blue-sky researchers.
22 May 2014 www.DigitalMediaButterfly.com 4
6. CRAWL THE WEB
• "Crawling" is the process of following links to
locate pages, and then reading those pages to
make the information on them searchable.
• The more links you have from higher authority
pages, the greater your own pages’ authority
grows.
22 May 2014 www.DigitalMediaButterfly.com 6
7. BUILD THE INDEX
• Google separates the content into 2 indexes:
– Page Titles and Link Data
– Page Content
• So, when you search the web you are not searching
the web, you are actually searching Google’s
databases.
• Google’s index is well over 100 million gigabytes.
– Google has spent over 1 million computing hours to build
it.
22 May 2014 www.DigitalMediaButterfly.com 7
8. RANK THE CONTENT
• Before the content is added to Google’s database:
– Google estimates the domain and page’s overall
authority
– All pages are checked against Google’s editorial policies
– Penalties are calculated
• Each page now has data in Google’s database and
will be displayed in relevant searches
22 May 2014 www.DigitalMediaButterfly.com 8
9. USER QUERY
• Someone goes to Google.com and enters a
search.
• Google suggests searches based on what has
been typed so far.
• Google uses synonyms to look for similar words
to include in the search query.
22 May 2014 www.DigitalMediaButterfly.com 9
10. CREATE RESULT
• Google may find millions of results, but only the top
1,000 or less are displayed. Google’s goal is to give
you the answer, not lots of webpages.
• Local websites are promoted in the results.
• Duplicate results are removed.
• Relevant ads are found and placed in order to
appear with the search results.
22 May 2014 www.DigitalMediaButterfly.com 10
11. SERVE SEARCH RESULTS
• Google often promotes past websites the user
has visited.
• Google places additional weight on results
featuring trending topics.
• If there are multiple pages from the same
domain with high rank, they may all be clustered
together.
22 May 2014 www.DigitalMediaButterfly.com 11
12. ANATOMY OF A GOOGLE
RESULTS PAGE
22 May 2014 www.DigitalMediaButterfly.com 12
13. Additional Resources
• How does Google search work?
with Matt Cutts
http://youtu.be/KyCYyoGusqs
• Google’s Inside Search
http://www.google.com/insidesearch/
22 May 2014 www.DigitalMediaButterfly.com 13
Notes de l'éditeur
You notice that this says nothing about the web, crawling, PageRank or any of the other details -- these guys think big
Google had 9,800 searches per day (about 3.6 million total) in it’s first year
You need to check that your friend, Bob, has your correct phone number… …, but you cannot ask him directly. You must write the question on a card which and give it to Eve who will take the card to Bob and return the answer to you. What must you write on the card, besides the question, to ensure Bob can encode the message so that Eve cannot read your phone number?
How many golf balls fit in a school bus?
Google finesses its ranking algorithms with over 500 improvements per year.
The May 2014 update was a “softer”, kinder update to actual websites, cracking down on sites trying to “game”, cheat the system – this will affect about 8% of the searches on Google
Google has been rolling out updates monthly
Google is working on more changes to help small businesses by pushing spammers and content mills into far less prominent search results
Google aims to be carbon neutral with their data centers.
Google servers are housed in standard shipping containers that hold 1,160 servers each.
In March 2013 there were 30 trillion webpages
Google uses a Universal Search, they do not just search webpages, but images, maps, books, video, and social meida
"Crawling” is sometimes known as robot spidering, gathering or harvesting.
If there are no links to your site, typically it will not get crawled deeply or regularly
The Google crawler, known as GoogleBot, crawls all the URLs it knows about every few weeks. It checks that the page is still available, gets any updated information, and follows links to pages it hasn't seen before. Some sites, such as news sites, get crawled more frequently, so that the Google index has the most recent data -- they could be indexed daily or even hourly.
These robots have to be sensitive to webmasters, so they limit the number of times they hit each site per minute. The software is very fast, so they can crawl many sites in parallel.
The GoogleBot, like all the other major search engine crawlers, obeys the "robots.txt" directives, avoiding pages which the webmaster has designated as off
Links to all my clients sites from my website
Links to my website from all my client’s sites
The page titles and link data index is used for broad and competitive searches
The page content index is used for obscure and long tail searches
Google’s database is constantly updated
Google knows about 3 billion web documents, including images, PDF and other file formats, Usenet newsgroups and news.
See handout for search term operators
As you type your query, you will start seeing predictions of searches you might be looking for and results showing up, without you having to hit enter. It saves you time and gets you to your answer as quickly as possible. This is called Google Instant.
The search query travels on average 1,500 miles to get the answer back to you (and may hit different data centers around the world along the way), at a speed that’s close to the speed of light, hundreds of millions of miles per hour.
The Google algorithm looks at the query and uses over 200 signals to decide which of the millions of pages and content are the most relevant answers for that query. What are these signals?
The freshness of the content
The number of other websites linking to a particular site and the authority of those links
Words on the webpage (are they spelled correctly?)
Synonyms of the search keywords
Quality of the content on the site
URL and Title of Webpage
Whether the best result is a webpage, image, video, news article, etc.
Personalization
Once Google has matched a word in the index, it wants to put the best document first. It chooses the best document using a number of techniques:
- Text analysis: evaluating the documents based on matching words, font size, proximity, and over 100 other factors
- Links & link text: external links are a somewhat independent guide to what's on page.
PageRank: a query-independent measure of the quality of both pages and sites. This is an equation that tries to indicate how often a truly random searcher, following links without any thought, would end up at a particular page.
Google has a "wall" between the search relevance ranking and advertising: no one can pay to be the top listing in the results, but the sponsored link spots are available.