The document discusses the inner workings of the Google search engine. It begins with facts about Google's founding and history. It then explains the basic components of how any search engine works, including web crawlers that index pages, and how keywords are matched to search results. The bulk of the document focuses on Google's specific architecture, including its web crawler called Googlebot, its indexer that catalogs words in a database, and its query processor that matches searches to relevant pages based on factors like PageRank. It also discusses related topics like search engine optimization techniques and using "Google digging" to refine searches.
2. 2
Facts About Google
How A Search Engine Works
** Types Of search engine
How Google Works
** Google Architecture
** Google Web Crawler
** Google indexer
** Google Query Processor
Goole Working Info graphic
What Is Seo
** SEO techniques
What Is Google Digging
** Methods Of Google Digging
Technology Requirements Of
Creating Search Engine
TOPICS TO BE COVEREDTOPICS TO BE COVERED
3. FACTS ABOUT
3
• Google was founded by Larry Page and Sergey Brin while they were Ph.D.
students at Stanford University
• Founded on 4th
september 1998.
• Google uses approximately 20 petabytes of user-generated data every
day. (Petabytes are estimated at 10 to the 15th power bytes.)
• In June 2006, the Oxford English Dictionary (OED) added “Google” as a
verb
• A Google employee is named a “Googler” while a new team member is
called a “Noogler
4. 4
• The name ‘Google’ was an accident. A spelling
mistake made by the original founders who
thought they were going for ‘Googol’
• The prime reason the Google home page is so
bare is due to the fact that the founders didn’t
know HTML and just wanted a quick interface. In
fact it was noted that the submit button was a
long time coming and hitting the RETURN key
was the only way to burst Google into life.
• Google has the largest network of translators in
the world
• On average, Google has acquired more than one
company every week since 2010.
5. 5
• On average, Google has acquired more than one
company every week since 2010.
• Google might be the only company with the
explicit goal to REDUCE the amount of time
people spend on its site.
• The world watches 450,000 years of YouTube
videos each month, over twice as long as modern
humans have existed.
• Google has photographed 5 million miles of road
for its Street View maps
• Google.com, home to arguably the world's most
important internet company, contains 23 markup
errors in its code.
6. HOW A SEARCH ENGINE WORKS
6
A program that searches for and identifies items in a database that
correspond to keywords or characters specified by the user, used
especially for finding particular sites on the Internet.
Or simply
A search engine is a database system designed to index and categorize
internet addresses, otherwise known as URLs.
FACTS ABOUT SEARCH ENGINESFACTS ABOUT SEARCH ENGINES
Search Engine Popularity
The most popular search engines on the
web:
Google 55.2%
Yahoo 21.7%
MSN Search 9.6%
AOL Search 3.8%
Terra Lycos 2.6%
AltaVista 2.2%
AskJeeves 1.5%
7. 7
Number of Words Used in Search Phrases
2-word phrases 32.58%
3-word phrase 25.61%
1-word phrases 19.02%
4-word phrases 12.83%
5-word phrases 5.64%
6-word phrases 2.32%
7-word phrases 0.98%
When People Search
The breakdown of surfer traffic by day of the week:
Monday 15.31%
Tuesday 15.23%
Thursday 14.73%
Wednesday 14.62%
Friday 14.48%
Saturday 13.08%
Sunday 12.55%
Screen Resolutions
The most popular screen resolutions on the web:
1024 x 768 48.3%
800 x 600 31.7%
1280 x 1024 13.6%
1152 x 864 4.0%
640 x 480 1.0%
1600 x 1200 1.0%
1152 x 870 0.2%
8. TYPES OF SEARCH ENGINES
8
Automatic:
These search engines are based on information that is
collected, sorted and analyzed by software programs,
commonly referred to as "robots", "spiders", or "crawlers".
These spiders crawl through web pages collecting information
which is then analyzed and categorized into an "index". When
you conduct a search using one of these search engines, you
are really searching the index. The results of the search will
depend on the contents of that index and its relevancy to your
query.
9. 9
Directories:
A directory is a searchable subject guide of Web sites
that have been reviewed and compiled by human
editors. These editors decide which sites to list, and, in
which categories.
Meta:
Meta search engines use automated technology to
gather information from a spider and then deliver a
summary of that information as the results of a search
to the end user.
Pay-per-click (PPC):
A search engine that determines ranking according to
the dollar amount you pay for each click from that
search engine to your site. Examples of PPC search
engines are Overture.com and FindWhat.com. The
highest ranking goes to the highest bidder.
10. HOW GOOGLE WORKS
10
Google runs on a distributed network of thousands of low-cost computers
and can therefore carry out fast parallel processing. Parallel processing is a
method of computation in which many calculations can be performed
simultaneously, significantly speeding up data processing. Google has three
distinct parts:
Google bot, a web crawler that finds and fetches web pages.
The indexer that sorts every word on every page and stores the resulting
index of words in a huge database.
The query processor, which compares your search query to the index and
recommends the documents that it considers most relevant.
12. 12
Googlebot, Google’s Web Crawler
Googlebot is Google’s web crawling robot, which finds and retrieves pages
on the web and hands them off to the Google indexer. It’s easy to imagine
Googlebot as a little spider scurrying across the strands of cyberspace, but
in reality Googlebot doesn’t traverse the web at all. It functions much like
your web browser, by sending a request to a web server for a web page,
downloading the entire page, then handing it off to Google’s indexer.
Googlebot consists of many computers requesting and fetching pages
much more quickly than you can with your web browser. In fact,
Googlebot can request thousands of different pages simultaneously. To
avoid overwhelming web servers, or crowding out requests from human
users, Googlebot deliberately makes requests of each individual web
server more slowly than it’s capable of doing.
13. 13
Google’s Indexer
Googlebot gives the indexer the full text of the pages it finds.
These pages are stored in Google’s index database. This index is
sorted alphabetically by search term, with each index entry
storing a list of documents in which the term appears and the
location within the text where it occurs. This data structure
allows rapid access to documents that contain user query terms.
To improve search performance, Google ignores (doesn’t index)
common words called stop words (such as the, is, on, or, of,
how, why, as well as certain single digits and single letters). Stop
words are so common that they do little to narrow a search, and
therefore they can safely be discarded. The indexer also ignores
some punctuation and multiple spaces, as well as converting all
letters to lowercase, to improve Google’s performance.
15. 15
Google’s Query Processor
The query processor has several parts, including the user
interface (search box), the “engine” that evaluates queries
and matches them to relevant documents, and the results
formatter.
PageRank is Google’s system for ranking web pages. A page
with a higher PageRank is deemed more important and is
more likely to be listed above a page with a lower
PageRank.
Google considers over a hundred factors in computing a
PageRank and determining which documents are most
relevant to a query, including the popularity of the page,
the position and size of the search terms within the page,
and the proximity of the search terms to one another on
the page. A patent application discusses other factors that
Google considers when ranking a page.
18. SEO-Search Engine Optimization
18
Search Engine Optimization is the process of
improving the visibility of a website on organic
("natural" or un-paid) search engine result
pages (SERPs), by incorporating search engine
friendly elements into a website. A successful
search engine optimization campaign will have,
as part of the improvements, carefully select,
relevant, keywords which the on-page
optimization will be designed to make
prominent for search engine algorithms.
Search engine optimization is broken down
into two basic areas: on-page, and off-page
optimization.
On-page optimization refers to website
elements which comprise a web page, such as
HTML code, textual content, and images.
Off-page optimization refers,
predominantly, to backlinks (links pointing to
the site which is being optimized, from other
relevant websites).
19. 19
Optimize your title tags
Create compelling meta descriptions
Utilize keyword-rich headings
Add ALT tags to your images
Create a sitemap
Build internal links between pages
Update your site regularly
Image Optimization
URL Optimization
Directory Submission
Commenting
Social Networking
Guest Posting
SEO cont.
…
Various SEO techniques:-
20. GOOGLE DIGGING
20
The art of searching any content
using google is called Google
digging or the art of googling or
sometimes even Google hacking
Google Dorks or search techniques which
can be used to refine our search
1) Intitle :
2) Filetype :
3) Site :
4) Related
5) Inurl :
22. Technology Requirements Of Creating
Search Engine
22
For back-end:-
Asp.Net
PHP
Python
Perl
Or your customized language
For database
•MySql
•Oracle technology
•Any Nosql Databases
•Or any customized database
There are various technologies which can be used to create search engine and
web crawlers ,Bots and query indexer.
For Front-End
•Javascript
•Xml
•JSON
•Dart etc.