SlideShare une entreprise Scribd logo
1  sur  18
Demand, Media, and Search
Analytics




Sean Timm
sean.timm@teamaol.com
Twitter: @timmsc
October 4, 2011
Introduction
• Who am I?
• What do we use Hadoop for?
• Our best practices
• Lessons learned
• The related searches, seasonality—example applications




                                                      Page 2
History
• Originated in Search Backend in 2007
• Create data driven products for search.aol.com from search
  logs
• No Netezza experience, decided to try Hadoop
• Took 3 weeks to write simple aggregation
• Apache Pig 0.3—2 days
• First product, related searches, launched in 2008
• Search breaking trends product led to further demand work
• Now Pig 0.8.1 and Hadoop 0.20.2




                                                       Page 3
Data
Hourly search.aol.com logs
•5 M log lines of data per hour
•Logs include searches, clicks, and other data
•70% of queries we only see once


Hourly Wikipedia page view data
•public data set http://dammit.lt/wikistats
•7 M pages viewed per hour
•2.7 M English pages per hour


BeacoN logs
•Page view and click logs for AOL HuffingtonPost Media, Patch, and other AOL
properties


                                                                  Page 4
We like Pig!
•   Hourly, daily, and monthly search and click aggregation
•   Related searches
•   Auto complete dictionary
•   Mining spelling correction click through
•   Temporal pattern analysis
•   Classifying adult queries and URLs
•   Categorizing queries
•   Identifying queries in the form of a question or superlative
•   Identifying breaking trends in AOL Search and Wikipedia page views
•   Identify queries of local interest
•   Clustering queries using click graph, temporal distance, Carrot2, k-means
•   AOL HPMG stats and trends for page views, authors, tags, etc.



                                                                     Page 5
Pig Process in General
Script run time < 2 minutes to > 2 hours


Ad hoc…wild west


Complex shell scripts
1. load/copy/backup data
2. Launch multiple Pig scripts—some in parallel—some
   with serial dependencies
3. Check for errors—e-mail and halt
4. Load data into MySQL, Vertica, or Solr

                                              Page 6
Getting data out of Hadoop
First approach: special StoreFunc to write directly to MySQL/Solr
•Network: Required master be on the same network as the
cluster
•Speculative optimization: data would be written more than once
increasing contention as well as doing unnecessary writes
•Replication: writing to the master in parallel, serial replication
was slow (MySQL)
•Timeouts: occasionally a task failed and restarted (Solr)




                                                          Page 7
Getting Data out of Hadoop
MySQL/Vertica Now
•Write data to HDFS
•Copy from HDFS to local file system using CLI
•Load into database: LOAD DATA LOCAL INFILE from mysql client

Solr Now
•Custom StoreFunc writes Solr XML to HDFS
•Starting with Pig 0.7 fields are named using the Pig schema
•Copy from HDFS to local file system using CLI
•Load into Solr using remote streaming



                                                         Page 8
UDFs
• Use Piggy Bank and builtins when possible
• 89 custom UDFs packaged in a single jar
• Most are simple
   • Validate a URL, URL decode a string, calculate a hash value,

     date math, etc.
• Some are complex
    • Spell check/correct, LOESS regression, Carrot2 clustering,

      FFT, Euclidean distance, etc.




                                                          Page 9
Lessons learned
• Many small categorization scripts, better to use a larger single
  one
• Set priority on large time sensitive jobs that fight for resources
  with other jobs
• Fair scheduler
• Tuning the cluster for maps or reduces
• Don't write copious debug
• Use appropriate number of reducers (PARALLEL)




                                                           Page 10
Related Searches

  Group by Query
Challenges
• Adult terms
• Misspellings
• Breadth of suggestions
• Coverage
• Timeliness of suggestions
Process Flow
• Filter and clean data
    • Block adult terms, long queries, non-alpha, second+ pages,
      operators, URL like queries, search spam
    • Lower case


• Join to get query-related query groups
• Contextual spell correct within group
• Cluster related queries and pick the best from each
  group
• Load into Solr
Related Searches Graph     “The Eagles”

                                Hotel California

   The band




                     NFL
                                            Tribute




                                 Boston College
                                  Page 14
Classification
• Supervised learning
• Provide categorized set of queries and/or URLs
• Calculate a score based on the edge weights
• If the score exceeds a specified threshold the query or URL is
  tagged with the category
Applications Outside of Search
• Author/citation bipartite graph
• Social network graphs
• User/Page view graphs
Temporal traffic correlation of Wikipedia Page Views




                                                Page 17
Tomato Seasonality
May: planting tomatoes, tomato cages, types of tomatoes
June: pruning tomato plants
July: tomato diseases, tomato blight, tomato worm
August: tomato recipes, tomato soup, tomato sauce, tomato salsa
September: sun dried tomatoes, canning and freezing tomatoes
October: green tomato recipes




                                                      Page 18

Contenu connexe

Tendances

DotNetNuke Urls - Best practice for administrators, editors and developers
DotNetNuke Urls - Best practice for administrators, editors and developersDotNetNuke Urls - Best practice for administrators, editors and developers
DotNetNuke Urls - Best practice for administrators, editors and developers
brchapman
 
Frontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling FrameworkFrontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling Framework
sixtyone
 
Building a Better Knowledgebase: An Investigation of Current Practical Uses a...
Building a Better Knowledgebase: An Investigation of Current Practical Uses a...Building a Better Knowledgebase: An Investigation of Current Practical Uses a...
Building a Better Knowledgebase: An Investigation of Current Practical Uses a...
NASIG
 

Tendances (17)

Dude, where does my data go?
Dude, where does my data go?Dude, where does my data go?
Dude, where does my data go?
 
Pragmatic REST: recent trends in API design
Pragmatic REST: recent trends in API designPragmatic REST: recent trends in API design
Pragmatic REST: recent trends in API design
 
Pragmatic REST APIs
Pragmatic REST APIsPragmatic REST APIs
Pragmatic REST APIs
 
Big Data is frustrating
Big Data is frustratingBig Data is frustrating
Big Data is frustrating
 
DotNetNuke Urls - Best practice for administrators, editors and developers
DotNetNuke Urls - Best practice for administrators, editors and developersDotNetNuke Urls - Best practice for administrators, editors and developers
DotNetNuke Urls - Best practice for administrators, editors and developers
 
Winning the Big Data SPAM Challenge__HadoopSummit2010
Winning the Big Data SPAM Challenge__HadoopSummit2010Winning the Big Data SPAM Challenge__HadoopSummit2010
Winning the Big Data SPAM Challenge__HadoopSummit2010
 
How To Construct A Search Engine Friendly Website
How To Construct A Search Engine Friendly WebsiteHow To Construct A Search Engine Friendly Website
How To Construct A Search Engine Friendly Website
 
LSS CLE presentation on Looking for Lawyers in All the Right Places & Effecti...
LSS CLE presentation on Looking for Lawyers in All the Right Places & Effecti...LSS CLE presentation on Looking for Lawyers in All the Right Places & Effecti...
LSS CLE presentation on Looking for Lawyers in All the Right Places & Effecti...
 
Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
 
Geek basics
Geek basicsGeek basics
Geek basics
 
Rest
RestRest
Rest
 
Frontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling FrameworkFrontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling Framework
 
Itct year1 mitchell
Itct year1 mitchellItct year1 mitchell
Itct year1 mitchell
 
Google
GoogleGoogle
Google
 
Radicalize Your Library Catalog with Ebooks Your Patrons Can Keep Forever
Radicalize Your Library Catalog with Ebooks Your Patrons Can Keep ForeverRadicalize Your Library Catalog with Ebooks Your Patrons Can Keep Forever
Radicalize Your Library Catalog with Ebooks Your Patrons Can Keep Forever
 
Jinchao demo
Jinchao demoJinchao demo
Jinchao demo
 
Building a Better Knowledgebase: An Investigation of Current Practical Uses a...
Building a Better Knowledgebase: An Investigation of Current Practical Uses a...Building a Better Knowledgebase: An Investigation of Current Practical Uses a...
Building a Better Knowledgebase: An Investigation of Current Practical Uses a...
 

Similaire à Demand, Media, and Search Analytics at AOL

Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
Roxanne Missingham
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
NekoGato
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Flink Forward
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
GokulD
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)
Petter Skodvin-Hvammen
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx
Gambari Amosa Isiaka
 
2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr
Lucidworks (Archived)
 
Levine-Clark, Michael, and Barbara Kawecki, "Best Practices for Demand-Driven...
Levine-Clark, Michael, and Barbara Kawecki, "Best Practices for Demand-Driven...Levine-Clark, Michael, and Barbara Kawecki, "Best Practices for Demand-Driven...
Levine-Clark, Michael, and Barbara Kawecki, "Best Practices for Demand-Driven...
Michael Levine-Clark
 

Similaire à Demand, Media, and Search Analytics at AOL (20)

Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information Architecture
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
 
Info 2402 irt-chapter_3
Info 2402 irt-chapter_3Info 2402 irt-chapter_3
Info 2402 irt-chapter_3
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
 
Gary Price, MIT Program on Information Science
Gary Price, MIT Program on Information ScienceGary Price, MIT Program on Information Science
Gary Price, MIT Program on Information Science
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx
 
Managing Your Research Data
Managing Your Research DataManaging Your Research Data
Managing Your Research Data
 
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
 
Ir1
Ir1Ir1
Ir1
 
2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr
 
Haifa
HaifaHaifa
Haifa
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Levine-Clark, Michael, and Barbara Kawecki, "Best Practices for Demand-Driven...
Levine-Clark, Michael, and Barbara Kawecki, "Best Practices for Demand-Driven...Levine-Clark, Michael, and Barbara Kawecki, "Best Practices for Demand-Driven...
Levine-Clark, Michael, and Barbara Kawecki, "Best Practices for Demand-Driven...
 
Ubiquitous Solr - A Database's not-so-evil Twin
Ubiquitous Solr - A Database's not-so-evil TwinUbiquitous Solr - A Database's not-so-evil Twin
Ubiquitous Solr - A Database's not-so-evil Twin
 
Cassandra eu
Cassandra euCassandra eu
Cassandra eu
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Dernier (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Demand, Media, and Search Analytics at AOL

  • 1. Demand, Media, and Search Analytics Sean Timm sean.timm@teamaol.com Twitter: @timmsc October 4, 2011
  • 2. Introduction • Who am I? • What do we use Hadoop for? • Our best practices • Lessons learned • The related searches, seasonality—example applications Page 2
  • 3. History • Originated in Search Backend in 2007 • Create data driven products for search.aol.com from search logs • No Netezza experience, decided to try Hadoop • Took 3 weeks to write simple aggregation • Apache Pig 0.3—2 days • First product, related searches, launched in 2008 • Search breaking trends product led to further demand work • Now Pig 0.8.1 and Hadoop 0.20.2 Page 3
  • 4. Data Hourly search.aol.com logs •5 M log lines of data per hour •Logs include searches, clicks, and other data •70% of queries we only see once Hourly Wikipedia page view data •public data set http://dammit.lt/wikistats •7 M pages viewed per hour •2.7 M English pages per hour BeacoN logs •Page view and click logs for AOL HuffingtonPost Media, Patch, and other AOL properties Page 4
  • 5. We like Pig! • Hourly, daily, and monthly search and click aggregation • Related searches • Auto complete dictionary • Mining spelling correction click through • Temporal pattern analysis • Classifying adult queries and URLs • Categorizing queries • Identifying queries in the form of a question or superlative • Identifying breaking trends in AOL Search and Wikipedia page views • Identify queries of local interest • Clustering queries using click graph, temporal distance, Carrot2, k-means • AOL HPMG stats and trends for page views, authors, tags, etc. Page 5
  • 6. Pig Process in General Script run time < 2 minutes to > 2 hours Ad hoc…wild west Complex shell scripts 1. load/copy/backup data 2. Launch multiple Pig scripts—some in parallel—some with serial dependencies 3. Check for errors—e-mail and halt 4. Load data into MySQL, Vertica, or Solr Page 6
  • 7. Getting data out of Hadoop First approach: special StoreFunc to write directly to MySQL/Solr •Network: Required master be on the same network as the cluster •Speculative optimization: data would be written more than once increasing contention as well as doing unnecessary writes •Replication: writing to the master in parallel, serial replication was slow (MySQL) •Timeouts: occasionally a task failed and restarted (Solr) Page 7
  • 8. Getting Data out of Hadoop MySQL/Vertica Now •Write data to HDFS •Copy from HDFS to local file system using CLI •Load into database: LOAD DATA LOCAL INFILE from mysql client Solr Now •Custom StoreFunc writes Solr XML to HDFS •Starting with Pig 0.7 fields are named using the Pig schema •Copy from HDFS to local file system using CLI •Load into Solr using remote streaming Page 8
  • 9. UDFs • Use Piggy Bank and builtins when possible • 89 custom UDFs packaged in a single jar • Most are simple • Validate a URL, URL decode a string, calculate a hash value, date math, etc. • Some are complex • Spell check/correct, LOESS regression, Carrot2 clustering, FFT, Euclidean distance, etc. Page 9
  • 10. Lessons learned • Many small categorization scripts, better to use a larger single one • Set priority on large time sensitive jobs that fight for resources with other jobs • Fair scheduler • Tuning the cluster for maps or reduces • Don't write copious debug • Use appropriate number of reducers (PARALLEL) Page 10
  • 11. Related Searches Group by Query
  • 12. Challenges • Adult terms • Misspellings • Breadth of suggestions • Coverage • Timeliness of suggestions
  • 13. Process Flow • Filter and clean data • Block adult terms, long queries, non-alpha, second+ pages, operators, URL like queries, search spam • Lower case • Join to get query-related query groups • Contextual spell correct within group • Cluster related queries and pick the best from each group • Load into Solr
  • 14. Related Searches Graph “The Eagles” Hotel California The band NFL Tribute Boston College Page 14
  • 15. Classification • Supervised learning • Provide categorized set of queries and/or URLs • Calculate a score based on the edge weights • If the score exceeds a specified threshold the query or URL is tagged with the category
  • 16. Applications Outside of Search • Author/citation bipartite graph • Social network graphs • User/Page view graphs
  • 17. Temporal traffic correlation of Wikipedia Page Views Page 17
  • 18. Tomato Seasonality May: planting tomatoes, tomato cages, types of tomatoes June: pruning tomato plants July: tomato diseases, tomato blight, tomato worm August: tomato recipes, tomato soup, tomato sauce, tomato salsa September: sun dried tomatoes, canning and freezing tomatoes October: green tomato recipes Page 18