More Related Content Similar to Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch - Webinar (20) More from Amazon Web Services (20) Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch - Webinar1. Building Better Search for Wikipedia:
How We Did It Using Amazon
CloudSearch
July 26, 2012
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
2. Speakers
Paul Nelson Michael Bohlig Jon Handler
CTO Marketing Manager Solutions Architect
Search Technologies Amazon CloudSearch Amazon CloudSearch
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
3. Housekeeping items
! Polling questions
! Q&A will be at the end
! Recording and slides will be distributed and posted
(Slideshare & YouTube)
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
4. Agenda
! Amazon CloudSearch Overview
! Data Acquisition – Getting the Files from Wikipedia
! Data Processing – Clean-up and Preparation
! Indexing
! Queries and Relevancy Ranking
! Building the UI
! Final Results & Recommendations
! Q&A
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
5. Amazon CloudSearch
! Fully-managed, full-featured search service
! Automatically scales for data & traffic
! Handles both structured and unstructured data
! Near real-time indexing
! Up and running in less than 1 hour
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
6. Polling Question #1
What Are You Using For Search Today?
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
7. Introduction
SEARCHING WIKIPEDIA
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
8. Why Wikipedia?
! It’s awesome
! Default Wikipedia search is pretty bad &
everyone knows it
! It’s publicly available data
! It’s awesome
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
9. Why CloudSearch for Wikipedia?
! It’s awesome
! A great choice for a public search engine –
it lives in the internet
! First version up & running quickly
! Automatically scales to required query volume
! Rank expressions work great for Wikipedia relevancy
! Easy Search Domain Creation = Easy system iteration
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
10. Let’s try it!
http://wikipedia.searchtechnologies.com
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
11. Getting the Files from Wikipedia
DATA ACQUISITION
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
12. Wikipedia Dump Files
http://dumps.wikimedia.org/enwiki/latest/
! Desired files have the pattern:
enwiki-latest-pages-articles#.xml-*.bz2
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
13. Our Solution
Wikipedia
dump files
Content Processing Framework
Fetch Identify Article Open File Send to
Files Listing Files to Fetch Stream Processing CloudSearch
Amazon
CloudSearch
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
14. Content Processing Framework Advantages
! Process multiple files simultaneously
! Fully Streaming
• Files are never downloaded to local disk
• From Wikipedia à Streaming Processor à CloudSearch
! Very Fast (450 documents per second, end-to-end)
! Integrated Connectors / Web Crawlers
• SharePoint, Documentum, Web Sites, RDBMS, RightNow,
Confluence, Salesforce.com, etc.
! Text extraction (from PDF, Office Docs, etc.)
• Using Apache Tika
! Entity Extraction
• Names, places, companies, dates, phone numbers, zip codes, etc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
15. Polling Question #2
Where is your data stored?
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
16. Preparing the Data for Search
DATA PROCESSING
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
17. What Do Wikipedia Files Look Like?
Sample Wikipedia Data
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
18. Data Processing: Basic Requirements
! Decompression: BZip2 à UTF-8
! Process each page as a separate CloudSearch
document
• Multiple pages specified in a single XML file
! Skip #REDIRECT pages
! Compute document statistics
• Necessary for relevancy ranking
• Includes: Content size, title size, number of outbound links
• (FUTURE: Number of inbound links)
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
19. Data Processing: Advanced Feature Support
! Extract Categories
! Extract Author (IP address or author name)
! Extract Update Date
! Extract Document Type
• Wikipedia “name space” based on title prefix
! Determine Disambiguation Pages
• Based on certain Wikipedia {{templates}}
• Template whitelist and blacklist
! Produce Static Teaser
Before After
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
20. Sending Documents to CloudSearch
INDEXING
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
21. CloudSearch: Document ID
! @id = uniquely identifies every document in the index
• Must be made up of letters and digits (no spaces or punctuation)
<batch>
<add lang="en" version="5438086" id="wikipedia930503">
. . . FIELDS GO HERE . . .
</add>
</batch>
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
22. CloudSearch: Document Version
! @version = identifies most recent document
• Integer number, must always increase
• Updates or deletes to same doc ID must have larger @version
• My Formula: (System.currentTimeMillis() - 1325394000000)/1000
! Why does it exist?
• So that multiple processes can submit updates simultaneously
• Updates processed quickly are not overwritten by older updates
processed slowly
<batch>
<add lang="en" version="5438086" id="wikipedia930503">
. . . FIELDS GO HERE . . .
</add>
</batch>
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
23. CloudSearch Indexing Details
! Form Fields into CloudSearch SDF
! Submit in batches to CloudSearch
! Multiple open connections to CloudSearch
! Co-locate indexer on EC2 instance in same zone as
CloudSearch
• Several times better performance
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
24. CloudSearch SDF for Indexing
<batch>
<add lang="en" version="5438086" id="wikipedia930503">
<field name="title">Terran Trade Authority</field>
<field name="title_size">22</field>
<field name="content">
The 'Terran Trade Authority' is a science-fiction setting originally presented in a collection of four
large illustrated science…
</field>
<field name="content_size">893</field>
<field name="teaser"> The 'Terran Trade Authority' is a science-fiction setting originally presented
in a collection of four large illustrated science fiction books published between 1978 and…
</field>
<field name="url">http://en.wikipedia.org/wiki/Terran_Trade_Authority</field>
<field name="type">Article</field>
<field name="f_type">Article</field>
<field name="year">2012</field>
<field name="f_year">2012</field>
<field name="year_month">2012/01</field>
<field name="f_year_month">2012/01</field>
<field name="categories">Science fiction book series</field>
<field name="f_categories">Science fiction book series</field>
<field name="author">76.173.50.22</field>
<field name="f_author">76.173.50.22</field>
</add>
</batch>
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
25. XHTML
Page
Process
Latest
Listing
Pipeline
Fetch Extract
Start dumps.wikimedia.org/ URLS
enwiki/latest/ (Groovy
Script)
27
URLs
to
27
Dump
Files
Process
File
Pipeline
Open
Stream BZip2 XML
Sub
Job
URL Decompress Extractor
Compressed
Decompressed
data
stream stream Single
<page>
XML
plus
Metadata
Process
Page
Pipeline
Extract
Metadata
End-to-End Indexing and
Cleanse
Content
(Groovy
Script)
Post
XML
Amazon
CloudSearch
Dataflow Cleansed
XSL
Metadata
Transform
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
26. Providing good search results
QUERIES AND RELEVANCY
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
27. Recommendation: Debug Interface
! Useful tool for testing CloudSearch query behavior
Sample Debug Interface
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
28. Queries for Wikipedia
! Uses simple “q” parameter for user query string
! Selecting facets uses “bq” parameter
• For filtering a facet value: bq=(field name ‘value’)
• For excluding a facet value: bq=(not name: ‘value’)
• Can handle AND & OR
• Don’t forget to escape single-quotes
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
29. Relevancy Ranking
! In CloudSearch, this is done with Rank Expressions
• Affect relevancy using document-quality data, such as:
• Document Statistics
• Ratings
• Link Counting
• Editorial Comments
• Popularity
! Expressions are very flexible
• All types of mathematical functions available
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
30. Relevancy Ranking for Wikipedia
content
title
text
size
clog
cboost
size
tlog
tboost
relevance
FINAL
Germany
65253
4.815
192.58
7
0.845
-‐12.676
572
751.90
Outline
of
Germany
14238
4.153
166.14
18
1.255
-‐18.829
601
748.30
History
of
Germany
74750
4.874
194.94
30
1.477
-‐22.157
574
746.78
British
Army
Germany
2201
3.343
133.70
37
1.568
-‐23.523
589
699.18
rugby
union
team
New
Germany
337
2.528
101.11
11
1.041
-‐15.621
598
683.48
Embassy
of
Germany
516
2.713
108.51
28
1.447
-‐21.707
596
682.79
in
Moscow
RANK_EXPRESSION =
text_relevance + log10(content_size)*40.0 - log10(title_size)*15.0
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
31. Relevancy Ranking for Wikipedia:
De-Weighting “Wikipedia:” Types
! “Wikipedia:” docs not of general interest
• About the running and managing of Wikipedia
! Often very large
• Skews the statistics
RANK_EXPRESSION (adjusted) =
text_relevance
+ log10(content_size) * ( doc_boost == 1 ? 25.0:40.0 )
- log10(title_size)*15.0
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
32. Adding the Sizzle
BUILDING THE USER INTERFACE
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
33. Wikipedia Search UI Architecture
Tomcat
Twigkit
CloudSearch Platform
CloudSearch Java API
CloudSearch
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
34. UI Architecture
! Tomcat
• Java application container
! Twigkit
• Graphical user interface templates
• Handles navigators, controller events, presentation
! CloudSearch Platform
• API Translation Interface between Twigkit and CloudSearch API
! CloudSearch Java API
• Manages all communcations to/from CloudSearch
• Parameter construction / results parsing
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
35. Let’s Wrap It Up!
SUMMARY
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
36. Summary – Problems & Solutions
! Problem: Data Acquisition
• Solution: Content Processing Framework (Aspire)
! Problem: Data Processing
• Solution: Content Processing Framework (Aspire)
! Problem: Indexing
• Solution: CloudSearch SDF – Very easy to work with
! Problem: Query
• Solution: CloudSearch Query Parameters & Rank Expressions
! Problem: User Interface
• Solution: New CloudSearch Platform for Twigkit
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
37. Q&A
Enter questions on your
screen
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
38. Thank You
For More Information:
http://aws.amazon.com/cloudsearch/
http://www.searchtechnologies.com/wikipedia-cloudsearch.html
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.