SlideShare une entreprise Scribd logo
1  sur  51
Query Categorization at Scale 
NYC Search, Discovery & Analytics meetup 
September 23rd, 2014 
Alex Dorman, CTO 
alex at magnetic dot com
About Magnetic 
One of the largest aggregators of intent data 
2 
First company to focus 100% on applying search intent to display 
Proprietary media platform and targeting algorithm 
Display, mobile, video and social retargeting capabilities 
Strong solution for customer acquisition and retention
Search Retargeting 
3 
Search retargeting combines the purchase intent 
from search with the scale from display 
1) Magnetic collects 
search data 
2) Magnetic builds 
audience segments 
3) Magnetic serves 
retargeted ads 
4) Magnetic optimizes 
campaign
Slovak Academy of Science 
4 
Institute of Informatics 
- One of top European research institutes 
- Significant experience in: 
- Informational retrieval 
- Semantic WEB 
- Natural Language Processing 
- Parallel and Distributed processing 
- Graph/Networks Analysis
Search Data - Natural and Navigational 
5 
Natural Searches and Navigational Searches 
Natural Search: 
“iPhone” 
Navigational Search: 
“iPad Accessories”
Search Data – Page Keywords 
6 
Page keywords from 
article metadata: 
“Recipes, Cooking, 
Holiday Recipes”
Search Data – Page Keywords 
7 
Article Titles: 
“Microsoft is said to be in talks 
to acquire Minecraft”
Search Data – Why Categorize? 
8 
• Targeting categories instead of keywords = Scale 
• Use category name to optimize advertising as an 
additional feature in predictive models 
• Reporting by category is easier to grasp as compared to 
reporting by keyword
Query Categorization Problem 
• Input: Query 
• Output: Classification into a 
predefined taxonomy 
9 
?
Query Categorization – Academic Approach 
10 
• Usual approach (academic publications): 
– Get documents from a web search 
– Classify based on the retrieved documents
Query Categorization 
Query Categories 
apple 
Computers  Hardware 
Living  Food & Cooking 
FIFA 2006 
Sports  Soccer 
Sports  Schedules & Tickets 
Entertainment  Games & Toys 
cheesecake recipes 
Living  Food & Cooking 
Information  Arts & Humanities 
friendships poem 
Information  Arts & Humanities 
Living  Dating & Relationships 
11 
• Usual approach: 
• Get results for query 
• Categorize returned documents 
• Best algorithms work with the entire web (search 
API)
Long Time Ago … 
12 
• Relying on Bing Search API: 
– Get search results using the query we want to categorize 
– See if some category-specific “characteristic“ keywords appear 
in the results 
– Combine scores 
– Not too bad....
Long Time Ago … 
13 
• ... But .... 
• .... We have ~8Bn queries 
per month to categorize .... 
• $2,000 * 8,000 = Oh My!
Our Query Categorization Approach – Take 2 
• Use web replacement – Wikipedia
Our Query Categorization Approach – Take 2 
C 
15 
• Assign a category to each 
Wikipedia document (with 
a score) 
• Load all documents and 
scores into an index 
• Search within the index 
• Compute the final score 
for the query 
A 
B 
C 
B C
Measuring Quality 
16
Measuring Quality 
• Precision is the fraction of retrieved documents that are relevant to 
the find 
17 
• Recall is the fraction of the documents that are relevant to the query 
that are successfully retrieved.
Measuring Quality 
18 
Relevant Items 
(to the left from the 
straight line) 
Errors 
(in red) 
• Recall is the fraction of 
the documents that are 
relevant to the query that 
are successfully 
retrieved. 
• Precision is the fraction 
of retrieved documents 
that are relevant to the 
find
Measuring Quality 
• Measure result quality using F1 Score: 
19
Measuring Quality 
• Prepare test set using 
manually annotated 
queries 
• 10,000 queries annotated 
by crowdsourcing to 
Magnetic employees 
20
Step-by-Step 
21 
Let’s go into details
Query Categorization - Overview 
C 
22 
• Assign a category to each 
Wikipedia document (with 
a score) 
• Load all documents and 
scores into an index 
• Search within the index 
• Compute final score for 
the query 
A 
B 
C 
B C 
How?
Search within index 
Combine scores from each 
document in results 
Step-by-Step 
Create Map 
Category: {seed documents} 
Compute N-grams: 
Category: {n-grams: score} 
Parse Wikipedia: 
Document: 
{title, redirects, anchor text, etc} 
Categorize Documents 
Document: 
{title, redirects, anchor text, etc} 
{category: score} 
Build index 
Query 
Query: {category: score} 
Preparation Steps 
Real Time Query 
Categorization 
23
Step By Step – Seed Documents 
24 
• Each category represented by one or multiple wiki pages 
(manual mapping) 
• Example: Electronics & ComputingCell Phone 
– Mobile phone 
– Smartphone 
– Camera phone
N-grams Generation From Seed Wikipages 
25 
• Wikipedia is rich on links and metadata. 
• We utilize links between pages to find “similar 
concepts”. 
• Set of similar concepts is saved as list of n-grams
N-grams Generation From Seed Wikipages and Links 
26 
Mobile phone 1.0 
Smartphone 1.0 
Camera phone 1.0 
Mobile operating system 0.3413 
Android (operating system) 0.2098 
Tablet computer 0.1965 
Comparison of smartphones 0.1945 
Personal digital assistant 0.1934 
IPhone 0.1926 
• For each link we compute 
similarity of linked page with 
the seed page (as cosine 
similarity)
Extending Seed Documents with Redirects 
27 
• There are many redirects and alternative names in 
Wikipedia. 
• For example “Cell Phone” redirects to “Mobile Phone” 
• Alternative names are added to the list of n-grams of this 
category 
Mobil phone 1.0 
Mobilephone 1.0 
Mobil Phone 1.0 
Cellular communication standard 1.0 
Mobile communication standard 1.0 
Mobile communications 1.0 
Environmental impact of mobile phones 1.0 
Kosher phone 1.0 
How mobilephones work? 1.0 
Mobile telecom 1.0 
Celluar telephone 1.0 
Cellular Radio 1.0 
Mobile phones 1.0 
Cellular phones 1.0 
Mobile telephone 1.0 
Mobile cellular 1.0 
Cell Phone 1.0 
Flip phones 1.0 
…..
Creating index – What information to use? 
• Some information in Wikipedia helped more than other 
• We’ve tested combinations of different fields and applied different algorithms to 
select approach with the best results 
• Data Set for the test – KDD 2005 Cup “Internet User Search Query Categorization” – 
800 queries annotated by 3 reviewers 
28
Creating index – Parsed Fields 
• Fields for Categorization of Wikipedia documents: 
– title 
– abstract 
– db_category 
– fb_category 
– category 
29
What else goes into the index– Freebase, DBpedia 
30 
• Some of Freebase/DBpedia categories are mapped to Magnetic Taxonomy 
(manual mapping) 
• (Freebase and DBpedia have links back to Wikipedia documents) 
• Examples: 
– Arts & EntertainmentPop Culture & Celebrity News: 
Celebrity; music.artist; MusicalArtist; … 
– Arts & EntertainmentMovies/Television: 
TelevisionStation; film.film; film.actor; film.director; … 
– AutomotiveManufacturers: 
automotive.model, automotive.make
Wikipedia page categorization: n-gram matching 
31 
Ted Nugent 
Theodore Anthony "Ted" Nugent ( ; born December 13, 
1948) is an American rock musician from Detroit, 
Michigan. Nugent initially gained fame as the lead 
guitarist of The Amboy Dukes before embarking on a solo 
career. His hits, mostly coming in the 1970s, such as 
"Stranglehold", "Cat Scratch Fever", "Wango Tango", and 
"Great White Buffalo", as well as his 1960s Amboy Dukes 
… 
Article abstract 
rock and roll, 1970s in music, Stranglehold (Ted Nugent 
song), Cat Scratch Fever (song), Wango Tango (song), 
Conservatism in the United States, Gun politics in the 
United States, Republican Party (United States) 
Additional text from abstract links
Wikipedia page categorization: n-gram matching 
32 
Categories of Wikipedia Article
Wikipedia page categorization: n-gram matching 
33 
Found n-gram keywords with scores for categories 
rock musician - Arts & EntertainmentMusic:0.08979 
Advocate - Law-Government-PoliticsLegal:0.130744 
christians - LifestyleReligion and Belief:0.055088; 
- LifestyleWedding & Engagement:0.0364 
gun rights - NegativeFirearms:0.07602 
rock and roll - Arts & EntertainmentMusic:0.104364 
reality television series 
- LifestyleDating:0.034913; 
- Arts&EntertainmentMovies/Television:0.041453 
...
Wikipedia page categorization: n-gram matching 
34 
book.author - A&EBooks and Literature:0.95 
Musicgroup - A&EPop Culture & Celebrity News:0.95; 
- A&EMusic:0.95 
tv.tv_actor - A&EPop Culture & Celebrity News:0.95; 
- A&EMovies/Television:0.95 
film.actor - A&EPop Culture & Celebrity News:0.95; 
- A&EMovies/Television:0.95 
… 
DBpedia, Freebase mapping 
base.livemusic.topic; user.narphorium.people.topic; 
user.alust.default_domain.processed_with_review_queue; common.topic; 
user.narphorium.people.nndb_person; book.author; music.group_member; 
film.actor; … ; tv.tv_actor; people.person; … 
Freebase Categories found and matched 
MusicalArtist; MusicGroup; Agent; Artist; Person 
DBpedia Categories found and matched
Wikipedia page categorization: results 
35 
Document: Ted Nugent 
Arts & EntertainmentPop Culture & Celebrity News:0.956686 
Arts & EntertainmentMusic:0.956681 
Arts & EntertainmentMovies/Television:0.954364 
Arts & EntertainmentBooks and Literature:0.908852 
SportsGame & Fishing:0.874056 
Result categories for document combined from: 
- text n-gram matching 
- DBpedia mapping 
- Freebase mapping
Query Categorization 
• Take search fields 
• Search using standard Lucene’s 
implementation of TF/IDF scoring 
• Get results 
• Filter results using alternative 
name 
• Combine remaining document pre-computed 
categories 
• Remove low confidence score 
results 
• Return resulting set of categories 
with confidence score
Query Categorization: search within index 
• Searching within all data stored in Lucene index 
• Computing categories for each result normalized by Lucene score 
• Example: “Total recall Arnold Schwarzenegger” 
• List of documents found (with Lucene score): 
1. Arnold Schwarzenegger filmography; score: 9.455296 
2. Arnold Schwarzenegger; score: 6.130941 
3. Total Recall (2012 film); score: 5.9359055 
4. Political career of Arnold Schwarzenegger; score: 5.7361355 
5. Total Recall (1990 film); score: 5.197826 
6. List of awards and nominations received by Arnold Schwarzenegger; score: 4.9710693 
7. California gubernatorial recall election; score: 4.9665976 
8. Patrick Schwarzenegger; score: 3.2915113 
9. Recall election; score: 3.2077827 
10. Gustav Schwarzenegger; score: 3.1247897
Prune Results Based on Alternative Names 
• Searching within all data stored in Lucene index 
• Computing categories for each result normalized by Lucene score 
• Example: “Total recall Arnold Schwarzenegger” 
• List of documents found (with Lucene score): 
1. Arnold Schwarzenegger filmography; score: 9.455296 
2. Arnold Schwarzenegger; score: 6.130941 
3. Total Recall (2012 film); score: 5.9359055 
4. Political career of Arnold Schwarzenegger; score: 5.7361355 
5. Total Recall (1990 film); score: 5.197826 
6. List of awards and nominations received by Arnold Schwarzenegger; score: 4.9710693 
7. California gubernatorial recall election; score: 4.9665976 
8. Patrick Schwarzenegger; score: 3.2915113 
9. Recall election; score: 3.2077827 
10. Gustav Schwarzenegger; score: 3.1247897 
”Total recall Arnold Schwarzenegger” 
Alternative Names: 
total recall (upcoming film), 
total recall (2012 film), total recall, 
total recall (2012), total recall 2012
Prune Results Based on Alternative Names 
• Matched using alternative names 
1. Arnold Schwarzenegger filmography; score: 9.455296 
2. Arnold Schwarzenegger; score: 6.130941 
3. Total Recall (2012 film); score: 5.9359055 
4. Political career of Arnold Schwarzenegger; score: 5.7361355 
5. Total Recall (1990 film); score: 5.197826 
6. List of awards and nominations received by Arnold Schwarzenegger; score: 4.9710693 
7. California gubernatorial recall election; score: 4.9665976 
8. Patrick Schwarzenegger; score: 3.2915113 
9. Recall election; score: 3.2077827 
10. Gustav Schwarzenegger; score: 3.1247897
Retrieve Categories for Each Document 
2. Arnold Schwarzenegger; score: 6.130941 
Arts & EntertainmentMovies/Television:0.999924 
Arts & EntertainmentPop Culture & Celebrity News:0.999877 
Business:0.99937 
Law-Government-PoliticsPolitics:0.9975 
GamesVideo & Computer Games:0.986331 
3. Total Recall (2012 film); score: 5.9359055 
Arts & EntertainmentMovies/Television:0.999025 
Arts & EntertainmentHumor:0.657473 
5. Total Recall (1990 film); score: 5.197826 
Arts & EntertainmentMovies/Television:0.999337 
GamesVideo & Computer Games:0.883085 
Arts & EntertainmentHobbiesAntiques & Collectables:0.599569
Combine Results and Calculate Final Score 
“Total recall Arnold Schwarzenegger” 
Arts & EntertainmentMovies/Television:0.996706 
GamesVideo & Computer Games:0.960575 
Arts & EntertainmentPop Culture & Celebrity News:0.85966 
Business:0.859224
Combining Scores from Multiple Documents 
42 
If P(A(x,c)) is the probability that entity (query or 
document) x should be assigned to category c, than 
we can combine scores from multiple documents using 
following formula: 
P(R(q, D)) is the probability that query q is related to the document D 
P(A(D, c)) is the probability that document D should be assigned to category c
Combining Scores 
43 
• Should we limit number 
of documents in result 
set? 
• Based on research we 
decided to limit to the 
top 20
44 
Precision/Recall
45 
Categorizing Other Languages 
1 000 000+ Documents: 
Deutsch 
English 
Español 
Français 
Italiano 
Nederlands 
Polski 
Русский 
Sinugboanong Binisaya 
Svenska 
Tiếng Việt 
Winaray
46 
Categorizing other languages 
- In development 
- Combining indexes for multiple languages into one common 
index 
- Focus: 
- Spanish 
- French 
- German 
- Portuguese 
- Dutch
Preprocessing Workflow 
• Automatized Hadoop and local jobs 
• Luigi library and scheduler 
• Steps: 
– Download 
– Uncompressing 
– Parse Wikipedia/Freebase/DBpedia 
– Generate N-grams 
– JOIN together (Wikipage) – JSON per page 
– Preprocess Wikipage categories 
– Produce JSON for Solr or local index 
– Load to SOLR + Check quality 
47
Query Categorization: Scale 
• Scale is achieved by combination of 
multiple categorization boxes, load 
balancing, and Varnish (open source) 
cache layer in front of Solr 
• We have 6 servers in production 
today 
• Load Balancer - HAProxy 
• Capacity – 1,000 QPS/server 
• More servers can be added if needed 
Search 
Engine 
(Solr) 
Solr Index 
Cache 
(Varnish) 
Wikipedia 
Dump 
DBpedia 
Dump 
Freebase 
Dump 
Hadoop 
Reporting 
48 
LB
Architected for Scale 
• Bidders, AdServers developed in Python and use PyPy VM with JIT 
• Response time critical - typically under 100ms as measured by exchange 
• High volume of auctions – 200,000 QPS at peak 
• Hadoop – 25 nodes cluster 
• 3 DC – US East, West and London 
• Data centers have multiple load balancers – HAProxy 
• Overview of servers in production: 
• US East: 6LB, 45 Bidders, 6 AdServers, 4 trackers, 25 Hadoop, 9 Hbase, 8 Kyoto DB 
• US West: 3LB, 17 Bidders, 6 AdServers, 4 trackers, 4 Kyoto DB 
• London: 8 Bidders, 2 AdServers, 2 trackers, 4 Kyoto DB
ERD Challenge 
• ERD'14: Entity Recognition and Disambiguation Challenge 
• Organized as a workshop at SIGIR 2014 Gold Coast 
• Goal: Submit working systems that identify the entities mentioned in text 
• We participated in the “Short Text” track 
• 19 team participated in the challenge 
• We took 4th place
Q&A 
Q & A

Contenu connexe

Similaire à Magnetic - Query Categorization at Scale

Design for Findability: metadata, metrics and collaboration on LOC.gov
Design for Findability: metadata, metrics and collaboration on LOC.govDesign for Findability: metadata, metrics and collaboration on LOC.gov
Design for Findability: metadata, metrics and collaboration on LOC.govUXPA International
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Design for Findability at the Library of Congress
Design for Findability at the Library of CongressDesign for Findability at the Library of Congress
Design for Findability at the Library of CongressJill MacNeice
 
Trends in Music Recommendations 2018
Trends in Music Recommendations 2018Trends in Music Recommendations 2018
Trends in Music Recommendations 2018Karthik Murugesan
 
Personalized search
Personalized searchPersonalized search
Personalized searchToine Bogers
 
NTEN Webinar - Data Cleaning and Visualization Tools for Nonprofits
NTEN Webinar - Data Cleaning and Visualization Tools for NonprofitsNTEN Webinar - Data Cleaning and Visualization Tools for Nonprofits
NTEN Webinar - Data Cleaning and Visualization Tools for NonprofitsAzavea
 
From Ambition to Go Live SWIB.pdf
From Ambition to Go Live SWIB.pdfFrom Ambition to Go Live SWIB.pdf
From Ambition to Go Live SWIB.pdfRichardWallis3
 
From Ambition to Go Live
From Ambition to Go LiveFrom Ambition to Go Live
From Ambition to Go LiveRichard Wallis
 
CPRS Ottawa-Gatineau - Measuring Social Media Workshop - Sean Howard - thornl...
CPRS Ottawa-Gatineau - Measuring Social Media Workshop - Sean Howard - thornl...CPRS Ottawa-Gatineau - Measuring Social Media Workshop - Sean Howard - thornl...
CPRS Ottawa-Gatineau - Measuring Social Media Workshop - Sean Howard - thornl...keelangreen
 
Lecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationLecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationMarieke van Erp
 
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slidesMining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slidesMichael Mathioudakis
 
MECO3602 2014, Week 4 Lecture 'Duck Duck Go[ogle]: The politics of search
MECO3602 2014, Week 4 Lecture 'Duck Duck Go[ogle]: The politics of searchMECO3602 2014, Week 4 Lecture 'Duck Duck Go[ogle]: The politics of search
MECO3602 2014, Week 4 Lecture 'Duck Duck Go[ogle]: The politics of searchUniversity of Sydney
 
Talentbin Sales Deck
Talentbin Sales DeckTalentbin Sales Deck
Talentbin Sales DeckVishal Kumar
 
Talent Bin
Talent BinTalent Bin
Talent BinRyan Gum
 

Similaire à Magnetic - Query Categorization at Scale (20)

Design for Findability: metadata, metrics and collaboration on LOC.gov
Design for Findability: metadata, metrics and collaboration on LOC.govDesign for Findability: metadata, metrics and collaboration on LOC.gov
Design for Findability: metadata, metrics and collaboration on LOC.gov
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Design for Findability at the Library of Congress
Design for Findability at the Library of CongressDesign for Findability at the Library of Congress
Design for Findability at the Library of Congress
 
DMI Workshop: When Search Becomes Research
DMI Workshop: When Search Becomes ResearchDMI Workshop: When Search Becomes Research
DMI Workshop: When Search Becomes Research
 
Search enginebasics
Search enginebasicsSearch enginebasics
Search enginebasics
 
Trends in Music Recommendations 2018
Trends in Music Recommendations 2018Trends in Music Recommendations 2018
Trends in Music Recommendations 2018
 
Personalized search
Personalized searchPersonalized search
Personalized search
 
NTEN Webinar - Data Cleaning and Visualization Tools for Nonprofits
NTEN Webinar - Data Cleaning and Visualization Tools for NonprofitsNTEN Webinar - Data Cleaning and Visualization Tools for Nonprofits
NTEN Webinar - Data Cleaning and Visualization Tools for Nonprofits
 
From Ambition to Go Live SWIB.pdf
From Ambition to Go Live SWIB.pdfFrom Ambition to Go Live SWIB.pdf
From Ambition to Go Live SWIB.pdf
 
From Ambition to Go Live
From Ambition to Go LiveFrom Ambition to Go Live
From Ambition to Go Live
 
Lecture4 Social Web
Lecture4 Social Web Lecture4 Social Web
Lecture4 Social Web
 
Data Science.pptx
Data Science.pptxData Science.pptx
Data Science.pptx
 
CPRS Ottawa-Gatineau - Measuring Social Media Workshop - Sean Howard - thornl...
CPRS Ottawa-Gatineau - Measuring Social Media Workshop - Sean Howard - thornl...CPRS Ottawa-Gatineau - Measuring Social Media Workshop - Sean Howard - thornl...
CPRS Ottawa-Gatineau - Measuring Social Media Workshop - Sean Howard - thornl...
 
Lecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationLecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and Visualisation
 
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slidesMining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
Mining the Social Web - Lecture 1 - T61.6020 lecture-01-slides
 
MECO3602 2014, Week 4 Lecture 'Duck Duck Go[ogle]: The politics of search
MECO3602 2014, Week 4 Lecture 'Duck Duck Go[ogle]: The politics of searchMECO3602 2014, Week 4 Lecture 'Duck Duck Go[ogle]: The politics of search
MECO3602 2014, Week 4 Lecture 'Duck Duck Go[ogle]: The politics of search
 
Talentbin Sales Deck
Talentbin Sales DeckTalentbin Sales Deck
Talentbin Sales Deck
 
Talent Bin
Talent BinTalent Bin
Talent Bin
 
Similarity: Retrieving Documents
Similarity: Retrieving DocumentsSimilarity: Retrieving Documents
Similarity: Retrieving Documents
 
Assembling a Linked Ecosystem for the Performing Arts
Assembling a Linked Ecosystem for the Performing ArtsAssembling a Linked Ecosystem for the Performing Arts
Assembling a Linked Ecosystem for the Performing Arts
 

Dernier

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Dernier (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Magnetic - Query Categorization at Scale

  • 1. Query Categorization at Scale NYC Search, Discovery & Analytics meetup September 23rd, 2014 Alex Dorman, CTO alex at magnetic dot com
  • 2. About Magnetic One of the largest aggregators of intent data 2 First company to focus 100% on applying search intent to display Proprietary media platform and targeting algorithm Display, mobile, video and social retargeting capabilities Strong solution for customer acquisition and retention
  • 3. Search Retargeting 3 Search retargeting combines the purchase intent from search with the scale from display 1) Magnetic collects search data 2) Magnetic builds audience segments 3) Magnetic serves retargeted ads 4) Magnetic optimizes campaign
  • 4. Slovak Academy of Science 4 Institute of Informatics - One of top European research institutes - Significant experience in: - Informational retrieval - Semantic WEB - Natural Language Processing - Parallel and Distributed processing - Graph/Networks Analysis
  • 5. Search Data - Natural and Navigational 5 Natural Searches and Navigational Searches Natural Search: “iPhone” Navigational Search: “iPad Accessories”
  • 6. Search Data – Page Keywords 6 Page keywords from article metadata: “Recipes, Cooking, Holiday Recipes”
  • 7. Search Data – Page Keywords 7 Article Titles: “Microsoft is said to be in talks to acquire Minecraft”
  • 8. Search Data – Why Categorize? 8 • Targeting categories instead of keywords = Scale • Use category name to optimize advertising as an additional feature in predictive models • Reporting by category is easier to grasp as compared to reporting by keyword
  • 9. Query Categorization Problem • Input: Query • Output: Classification into a predefined taxonomy 9 ?
  • 10. Query Categorization – Academic Approach 10 • Usual approach (academic publications): – Get documents from a web search – Classify based on the retrieved documents
  • 11. Query Categorization Query Categories apple Computers Hardware Living Food & Cooking FIFA 2006 Sports Soccer Sports Schedules & Tickets Entertainment Games & Toys cheesecake recipes Living Food & Cooking Information Arts & Humanities friendships poem Information Arts & Humanities Living Dating & Relationships 11 • Usual approach: • Get results for query • Categorize returned documents • Best algorithms work with the entire web (search API)
  • 12. Long Time Ago … 12 • Relying on Bing Search API: – Get search results using the query we want to categorize – See if some category-specific “characteristic“ keywords appear in the results – Combine scores – Not too bad....
  • 13. Long Time Ago … 13 • ... But .... • .... We have ~8Bn queries per month to categorize .... • $2,000 * 8,000 = Oh My!
  • 14. Our Query Categorization Approach – Take 2 • Use web replacement – Wikipedia
  • 15. Our Query Categorization Approach – Take 2 C 15 • Assign a category to each Wikipedia document (with a score) • Load all documents and scores into an index • Search within the index • Compute the final score for the query A B C B C
  • 17. Measuring Quality • Precision is the fraction of retrieved documents that are relevant to the find 17 • Recall is the fraction of the documents that are relevant to the query that are successfully retrieved.
  • 18. Measuring Quality 18 Relevant Items (to the left from the straight line) Errors (in red) • Recall is the fraction of the documents that are relevant to the query that are successfully retrieved. • Precision is the fraction of retrieved documents that are relevant to the find
  • 19. Measuring Quality • Measure result quality using F1 Score: 19
  • 20. Measuring Quality • Prepare test set using manually annotated queries • 10,000 queries annotated by crowdsourcing to Magnetic employees 20
  • 21. Step-by-Step 21 Let’s go into details
  • 22. Query Categorization - Overview C 22 • Assign a category to each Wikipedia document (with a score) • Load all documents and scores into an index • Search within the index • Compute final score for the query A B C B C How?
  • 23. Search within index Combine scores from each document in results Step-by-Step Create Map Category: {seed documents} Compute N-grams: Category: {n-grams: score} Parse Wikipedia: Document: {title, redirects, anchor text, etc} Categorize Documents Document: {title, redirects, anchor text, etc} {category: score} Build index Query Query: {category: score} Preparation Steps Real Time Query Categorization 23
  • 24. Step By Step – Seed Documents 24 • Each category represented by one or multiple wiki pages (manual mapping) • Example: Electronics & ComputingCell Phone – Mobile phone – Smartphone – Camera phone
  • 25. N-grams Generation From Seed Wikipages 25 • Wikipedia is rich on links and metadata. • We utilize links between pages to find “similar concepts”. • Set of similar concepts is saved as list of n-grams
  • 26. N-grams Generation From Seed Wikipages and Links 26 Mobile phone 1.0 Smartphone 1.0 Camera phone 1.0 Mobile operating system 0.3413 Android (operating system) 0.2098 Tablet computer 0.1965 Comparison of smartphones 0.1945 Personal digital assistant 0.1934 IPhone 0.1926 • For each link we compute similarity of linked page with the seed page (as cosine similarity)
  • 27. Extending Seed Documents with Redirects 27 • There are many redirects and alternative names in Wikipedia. • For example “Cell Phone” redirects to “Mobile Phone” • Alternative names are added to the list of n-grams of this category Mobil phone 1.0 Mobilephone 1.0 Mobil Phone 1.0 Cellular communication standard 1.0 Mobile communication standard 1.0 Mobile communications 1.0 Environmental impact of mobile phones 1.0 Kosher phone 1.0 How mobilephones work? 1.0 Mobile telecom 1.0 Celluar telephone 1.0 Cellular Radio 1.0 Mobile phones 1.0 Cellular phones 1.0 Mobile telephone 1.0 Mobile cellular 1.0 Cell Phone 1.0 Flip phones 1.0 …..
  • 28. Creating index – What information to use? • Some information in Wikipedia helped more than other • We’ve tested combinations of different fields and applied different algorithms to select approach with the best results • Data Set for the test – KDD 2005 Cup “Internet User Search Query Categorization” – 800 queries annotated by 3 reviewers 28
  • 29. Creating index – Parsed Fields • Fields for Categorization of Wikipedia documents: – title – abstract – db_category – fb_category – category 29
  • 30. What else goes into the index– Freebase, DBpedia 30 • Some of Freebase/DBpedia categories are mapped to Magnetic Taxonomy (manual mapping) • (Freebase and DBpedia have links back to Wikipedia documents) • Examples: – Arts & EntertainmentPop Culture & Celebrity News: Celebrity; music.artist; MusicalArtist; … – Arts & EntertainmentMovies/Television: TelevisionStation; film.film; film.actor; film.director; … – AutomotiveManufacturers: automotive.model, automotive.make
  • 31. Wikipedia page categorization: n-gram matching 31 Ted Nugent Theodore Anthony "Ted" Nugent ( ; born December 13, 1948) is an American rock musician from Detroit, Michigan. Nugent initially gained fame as the lead guitarist of The Amboy Dukes before embarking on a solo career. His hits, mostly coming in the 1970s, such as "Stranglehold", "Cat Scratch Fever", "Wango Tango", and "Great White Buffalo", as well as his 1960s Amboy Dukes … Article abstract rock and roll, 1970s in music, Stranglehold (Ted Nugent song), Cat Scratch Fever (song), Wango Tango (song), Conservatism in the United States, Gun politics in the United States, Republican Party (United States) Additional text from abstract links
  • 32. Wikipedia page categorization: n-gram matching 32 Categories of Wikipedia Article
  • 33. Wikipedia page categorization: n-gram matching 33 Found n-gram keywords with scores for categories rock musician - Arts & EntertainmentMusic:0.08979 Advocate - Law-Government-PoliticsLegal:0.130744 christians - LifestyleReligion and Belief:0.055088; - LifestyleWedding & Engagement:0.0364 gun rights - NegativeFirearms:0.07602 rock and roll - Arts & EntertainmentMusic:0.104364 reality television series - LifestyleDating:0.034913; - Arts&EntertainmentMovies/Television:0.041453 ...
  • 34. Wikipedia page categorization: n-gram matching 34 book.author - A&EBooks and Literature:0.95 Musicgroup - A&EPop Culture & Celebrity News:0.95; - A&EMusic:0.95 tv.tv_actor - A&EPop Culture & Celebrity News:0.95; - A&EMovies/Television:0.95 film.actor - A&EPop Culture & Celebrity News:0.95; - A&EMovies/Television:0.95 … DBpedia, Freebase mapping base.livemusic.topic; user.narphorium.people.topic; user.alust.default_domain.processed_with_review_queue; common.topic; user.narphorium.people.nndb_person; book.author; music.group_member; film.actor; … ; tv.tv_actor; people.person; … Freebase Categories found and matched MusicalArtist; MusicGroup; Agent; Artist; Person DBpedia Categories found and matched
  • 35. Wikipedia page categorization: results 35 Document: Ted Nugent Arts & EntertainmentPop Culture & Celebrity News:0.956686 Arts & EntertainmentMusic:0.956681 Arts & EntertainmentMovies/Television:0.954364 Arts & EntertainmentBooks and Literature:0.908852 SportsGame & Fishing:0.874056 Result categories for document combined from: - text n-gram matching - DBpedia mapping - Freebase mapping
  • 36. Query Categorization • Take search fields • Search using standard Lucene’s implementation of TF/IDF scoring • Get results • Filter results using alternative name • Combine remaining document pre-computed categories • Remove low confidence score results • Return resulting set of categories with confidence score
  • 37. Query Categorization: search within index • Searching within all data stored in Lucene index • Computing categories for each result normalized by Lucene score • Example: “Total recall Arnold Schwarzenegger” • List of documents found (with Lucene score): 1. Arnold Schwarzenegger filmography; score: 9.455296 2. Arnold Schwarzenegger; score: 6.130941 3. Total Recall (2012 film); score: 5.9359055 4. Political career of Arnold Schwarzenegger; score: 5.7361355 5. Total Recall (1990 film); score: 5.197826 6. List of awards and nominations received by Arnold Schwarzenegger; score: 4.9710693 7. California gubernatorial recall election; score: 4.9665976 8. Patrick Schwarzenegger; score: 3.2915113 9. Recall election; score: 3.2077827 10. Gustav Schwarzenegger; score: 3.1247897
  • 38. Prune Results Based on Alternative Names • Searching within all data stored in Lucene index • Computing categories for each result normalized by Lucene score • Example: “Total recall Arnold Schwarzenegger” • List of documents found (with Lucene score): 1. Arnold Schwarzenegger filmography; score: 9.455296 2. Arnold Schwarzenegger; score: 6.130941 3. Total Recall (2012 film); score: 5.9359055 4. Political career of Arnold Schwarzenegger; score: 5.7361355 5. Total Recall (1990 film); score: 5.197826 6. List of awards and nominations received by Arnold Schwarzenegger; score: 4.9710693 7. California gubernatorial recall election; score: 4.9665976 8. Patrick Schwarzenegger; score: 3.2915113 9. Recall election; score: 3.2077827 10. Gustav Schwarzenegger; score: 3.1247897 ”Total recall Arnold Schwarzenegger” Alternative Names: total recall (upcoming film), total recall (2012 film), total recall, total recall (2012), total recall 2012
  • 39. Prune Results Based on Alternative Names • Matched using alternative names 1. Arnold Schwarzenegger filmography; score: 9.455296 2. Arnold Schwarzenegger; score: 6.130941 3. Total Recall (2012 film); score: 5.9359055 4. Political career of Arnold Schwarzenegger; score: 5.7361355 5. Total Recall (1990 film); score: 5.197826 6. List of awards and nominations received by Arnold Schwarzenegger; score: 4.9710693 7. California gubernatorial recall election; score: 4.9665976 8. Patrick Schwarzenegger; score: 3.2915113 9. Recall election; score: 3.2077827 10. Gustav Schwarzenegger; score: 3.1247897
  • 40. Retrieve Categories for Each Document 2. Arnold Schwarzenegger; score: 6.130941 Arts & EntertainmentMovies/Television:0.999924 Arts & EntertainmentPop Culture & Celebrity News:0.999877 Business:0.99937 Law-Government-PoliticsPolitics:0.9975 GamesVideo & Computer Games:0.986331 3. Total Recall (2012 film); score: 5.9359055 Arts & EntertainmentMovies/Television:0.999025 Arts & EntertainmentHumor:0.657473 5. Total Recall (1990 film); score: 5.197826 Arts & EntertainmentMovies/Television:0.999337 GamesVideo & Computer Games:0.883085 Arts & EntertainmentHobbiesAntiques & Collectables:0.599569
  • 41. Combine Results and Calculate Final Score “Total recall Arnold Schwarzenegger” Arts & EntertainmentMovies/Television:0.996706 GamesVideo & Computer Games:0.960575 Arts & EntertainmentPop Culture & Celebrity News:0.85966 Business:0.859224
  • 42. Combining Scores from Multiple Documents 42 If P(A(x,c)) is the probability that entity (query or document) x should be assigned to category c, than we can combine scores from multiple documents using following formula: P(R(q, D)) is the probability that query q is related to the document D P(A(D, c)) is the probability that document D should be assigned to category c
  • 43. Combining Scores 43 • Should we limit number of documents in result set? • Based on research we decided to limit to the top 20
  • 45. 45 Categorizing Other Languages 1 000 000+ Documents: Deutsch English Español Français Italiano Nederlands Polski Русский Sinugboanong Binisaya Svenska Tiếng Việt Winaray
  • 46. 46 Categorizing other languages - In development - Combining indexes for multiple languages into one common index - Focus: - Spanish - French - German - Portuguese - Dutch
  • 47. Preprocessing Workflow • Automatized Hadoop and local jobs • Luigi library and scheduler • Steps: – Download – Uncompressing – Parse Wikipedia/Freebase/DBpedia – Generate N-grams – JOIN together (Wikipage) – JSON per page – Preprocess Wikipage categories – Produce JSON for Solr or local index – Load to SOLR + Check quality 47
  • 48. Query Categorization: Scale • Scale is achieved by combination of multiple categorization boxes, load balancing, and Varnish (open source) cache layer in front of Solr • We have 6 servers in production today • Load Balancer - HAProxy • Capacity – 1,000 QPS/server • More servers can be added if needed Search Engine (Solr) Solr Index Cache (Varnish) Wikipedia Dump DBpedia Dump Freebase Dump Hadoop Reporting 48 LB
  • 49. Architected for Scale • Bidders, AdServers developed in Python and use PyPy VM with JIT • Response time critical - typically under 100ms as measured by exchange • High volume of auctions – 200,000 QPS at peak • Hadoop – 25 nodes cluster • 3 DC – US East, West and London • Data centers have multiple load balancers – HAProxy • Overview of servers in production: • US East: 6LB, 45 Bidders, 6 AdServers, 4 trackers, 25 Hadoop, 9 Hbase, 8 Kyoto DB • US West: 3LB, 17 Bidders, 6 AdServers, 4 trackers, 4 Kyoto DB • London: 8 Bidders, 2 AdServers, 2 trackers, 4 Kyoto DB
  • 50. ERD Challenge • ERD'14: Entity Recognition and Disambiguation Challenge • Organized as a workshop at SIGIR 2014 Gold Coast • Goal: Submit working systems that identify the entities mentioned in text • We participated in the “Short Text” track • 19 team participated in the challenge • We took 4th place
  • 51. Q&A Q & A