Making AJAX crawlable by katharina Probst & Bruce Johnson
1. Making AJAX crawlable
Katharina Probst
Engineer, Google
Bruce Johnson
Engineering Manager, Google
in collaboration with:
Arup Mukherjee, Erik van der Poel, Li Xiao, Google
2. The problem of AJAX for web crawlers
Web crawlers don't always see what the user sees
● JavaScript produces dynamic content that is not seen by crawlers
● Example: A Google Web Toolkit application that looks like this to a user...
...but a web crawler only sees this:
<script src='showcase.js'></script>
3. Why does this problem need to be solved?
● Web 2.0: More content on the web is created dynamically (~69%)
● Over time, this hurts search
● Developers are discouraged from building dynamic apps
● Not solving AJAX crawlability holds back progress on the web!
4. A crawler's view of the web - with and without AJAX
Crawler
Web
Server
Browser
Browser
Web
Server
www.example.com/mystate
www.example.com/
What the crawler can't seeWhat the crawler can see
With
AJAX
Without
AJAX
#mystate
5. ● Crawling and indexing AJAX is needed for users and developers
● Problem: Which AJAX states can be indexed?
○ Explicit opt-in needed by the web server
● Problem: Don't want to cloak
○ Users and search engine crawlers need to see the same content
● Problem: How could the logistics work?
○ That's the remainder of the presentation
Goal: crawl and index AJAX
6. Possible solutions
● Crawlers execute all the web's JavaScript
○ This is expensive and time-consuming
○ Only major search engines would even be able to do this, and
probably only partially
○ Indexes would be more stale, resulting in worse search results
● Web servers execute their own JavaScript at crawl time
○ Avoids above problems
○ Gives more control to webmasters
○ Can be done automatically
○ Does not require ongoing maintenance
7. Overview of proposed approach - crawl time
Your Web
Server
Crawler
Headless
browser
3. Server maps from ugly URL to pretty URL:
www.example.com/page?query#!mystate
4. Server invokes headless browser
5. Headless browser executes JavaScript and produces an
HTML snapshot for pretty URL
6. Crawler processes
HTML snapshot, extracts
pretty URLs
1. Crawler maps from pretty URL to ugly URL:
www.example.com/page?query&_escaped_fragment_=mystate
2. Requests ugly URL
HTML
snapsho
t
Crawling is enabled by mapping between
● "pretty" URLs: www.example.com/page?query#!mystate
● "ugly" URLs: www.example.com/page?query&_escaped_fragment_=mystate
8. Overview of proposed approach - search time
Search
engine
1. User submits query
2. Search engine returns pretty URL:
www.example.com/page?query#!mystate
Browser
3. User clicks on pretty URL link
4. Browser returns pretty URL:
www.example.com/page?query#!mystate
Nothing changes!
9. Agreement between participants
● Web servers agree to
○ opt in by indicating indexable states
○ execute JavaScript for ugly URLs (no user agent sniffing!)
○ not cloak by always giving same content to browser and crawler
regardless of request (or risk elimination, as before)
● Search engines agree to
○ discover URLs as before (Sitemaps, hyperlinks)
○ modify pretty URLs to ugly URLs
○ index content
○ display pretty URLs
10. Summary: Life of a URL
http://example.com/stocks.html#GOOG
could easily be changed to
http://example.com/stocks.html#!GOOG
which can be crawled as
http://example.com/stocks.html?_escaped_fragment_=GOOG
but will be displayed in the search results as
http://example.com/stocks.html#!GOOG
11. Feedback is welcome
● We are currently working on a proposal and prototype implementation
● Check out the blog post on the Google Webmaster Central Blog: http:
//googlewebmastercentral.blogspot.com
● We welcome feedback from the community at the Google Webmaster
Help Forum (link is posted in the blog entry)