Driving Behavioral Change for Information Management through Data-Driven Gree...
Demand, Media, and Search Analytics at AOL
1. Demand, Media, and Search
Analytics
Sean Timm
sean.timm@teamaol.com
Twitter: @timmsc
October 4, 2011
2. Introduction
• Who am I?
• What do we use Hadoop for?
• Our best practices
• Lessons learned
• The related searches, seasonality—example applications
Page 2
3. History
• Originated in Search Backend in 2007
• Create data driven products for search.aol.com from search
logs
• No Netezza experience, decided to try Hadoop
• Took 3 weeks to write simple aggregation
• Apache Pig 0.3—2 days
• First product, related searches, launched in 2008
• Search breaking trends product led to further demand work
• Now Pig 0.8.1 and Hadoop 0.20.2
Page 3
4. Data
Hourly search.aol.com logs
•5 M log lines of data per hour
•Logs include searches, clicks, and other data
•70% of queries we only see once
Hourly Wikipedia page view data
•public data set http://dammit.lt/wikistats
•7 M pages viewed per hour
•2.7 M English pages per hour
BeacoN logs
•Page view and click logs for AOL HuffingtonPost Media, Patch, and other AOL
properties
Page 4
5. We like Pig!
• Hourly, daily, and monthly search and click aggregation
• Related searches
• Auto complete dictionary
• Mining spelling correction click through
• Temporal pattern analysis
• Classifying adult queries and URLs
• Categorizing queries
• Identifying queries in the form of a question or superlative
• Identifying breaking trends in AOL Search and Wikipedia page views
• Identify queries of local interest
• Clustering queries using click graph, temporal distance, Carrot2, k-means
• AOL HPMG stats and trends for page views, authors, tags, etc.
Page 5
6. Pig Process in General
Script run time < 2 minutes to > 2 hours
Ad hoc…wild west
Complex shell scripts
1. load/copy/backup data
2. Launch multiple Pig scripts—some in parallel—some
with serial dependencies
3. Check for errors—e-mail and halt
4. Load data into MySQL, Vertica, or Solr
Page 6
7. Getting data out of Hadoop
First approach: special StoreFunc to write directly to MySQL/Solr
•Network: Required master be on the same network as the
cluster
•Speculative optimization: data would be written more than once
increasing contention as well as doing unnecessary writes
•Replication: writing to the master in parallel, serial replication
was slow (MySQL)
•Timeouts: occasionally a task failed and restarted (Solr)
Page 7
8. Getting Data out of Hadoop
MySQL/Vertica Now
•Write data to HDFS
•Copy from HDFS to local file system using CLI
•Load into database: LOAD DATA LOCAL INFILE from mysql client
Solr Now
•Custom StoreFunc writes Solr XML to HDFS
•Starting with Pig 0.7 fields are named using the Pig schema
•Copy from HDFS to local file system using CLI
•Load into Solr using remote streaming
Page 8
9. UDFs
• Use Piggy Bank and builtins when possible
• 89 custom UDFs packaged in a single jar
• Most are simple
• Validate a URL, URL decode a string, calculate a hash value,
date math, etc.
• Some are complex
• Spell check/correct, LOESS regression, Carrot2 clustering,
FFT, Euclidean distance, etc.
Page 9
10. Lessons learned
• Many small categorization scripts, better to use a larger single
one
• Set priority on large time sensitive jobs that fight for resources
with other jobs
• Fair scheduler
• Tuning the cluster for maps or reduces
• Don't write copious debug
• Use appropriate number of reducers (PARALLEL)
Page 10
13. Process Flow
• Filter and clean data
• Block adult terms, long queries, non-alpha, second+ pages,
operators, URL like queries, search spam
• Lower case
• Join to get query-related query groups
• Contextual spell correct within group
• Cluster related queries and pick the best from each
group
• Load into Solr
14. Related Searches Graph “The Eagles”
Hotel California
The band
NFL
Tribute
Boston College
Page 14
15. Classification
• Supervised learning
• Provide categorized set of queries and/or URLs
• Calculate a score based on the edge weights
• If the score exceeds a specified threshold the query or URL is
tagged with the category
16. Applications Outside of Search
• Author/citation bipartite graph
• Social network graphs
• User/Page view graphs