SlideShare une entreprise Scribd logo
1  sur  37
SpotSigs   Robust & Efficient Near Duplicate Detection in Large Web Collections Martin Theobald Jonathan Siddharth Andreas Paepcke Stanford University Sigir 2008, Singapore
Near-Duplicate News Articles (I) June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
Near-Duplicate News Articles (II) June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
Our Setting June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
…  but ,[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
What is SpotSigs? ,[object Object],[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
Case (I): What’s different about the  core contents? = stopword occurrences:  the ,  that ,  {be} ,  {have} June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
Case(II): Do not consider for deduplication! no occurrences of:  the, that ,  {be} ,  {have} June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
Spot Signature Extraction ,[object Object],[object Object],[object Object],antecedent nearby n-gram Spot Signature  s June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
Spot Signature Extraction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
Signature Extraction Example ,[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections S  = { a :rally:kick,  a :weeklong:campain,  the :south:carolina,  the :record:straight,  an :attack:circulating,  the :internet:designed,  is :designed:play}
Signature Extraction Algorithm ,[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
Choice of Antecedents  ,[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
Done? SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections June 4, 2009
How to deduplicate a large collection  efficiently ? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
Which documents (not) to compare? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],   Idea:  Two signature sets  A ,  B  can only have high (Jaccard) similarity if they are of similar cardinality! ? June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
Upper bound for Jaccard ,[object Object],[object Object],[object Object],[object Object],[object Object],(for |B|≥|A|, w.l.o.g.) June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
Multi-set Generalization ,[object Object],[object Object],[object Object],|A|   |B|  June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
Partitioning the Collection ,[object Object],[object Object],[object Object],#sigs per doc ,[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections S i S j ?
Partitioning the Collection ,[object Object],[object Object],[object Object],#sigs per doc ,[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections S i S j Also:  Partition widths should be a function of  τ
Optimal Partitioning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],   But expensive to solve  exactly  … June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
Approximate Solution ,[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections “ Starting with p 0  = 1, for any given p k  ,  choose p k+1  as the smallest integer p k+1  > p k   s.t.  p k+1  − p k  > (1 −  τ ) p k+1  ” E.g.   (for  τ =0.7 ) :  p 0 =1 , p 1 =3 , p 2 =6 , p 3 =10 ,…, p 7 =43 , p 8 =59,…
Partitioning Effects  ,[object Object],[object Object],Uniform bucket widths Progressively increasing bucket widths #docs #sigs #docs #sigs June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
…  but  ,[object Object],[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
Inverted Index Pruning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections the:campaign an:attack d 7 :8 d 6 :6 d 7 :4 … … d 2 :6 d 5 :3 d 1 :3 d 1 :5 d 5 :4 Partition k
Deduplication Example Deduplicate d 1 S 3 :  1)  δ 1 =0,  δ 2 =1 ->  sim(d 1 ,d 3  ) = 0.8 2)  d 1 =d 1  -> continue 3)   δ 1 =4,  δ 2 =0 ->  break! June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections δ 2 =1 δ 1 =4 Given: d 1  = {s 1 :5, s 2 :4, s 3 :4} ,  |d 1 |=13  d 2  = {s 1 :8, s 2 :4},  |d 2 |=12 d 3  = {s 1 :4, s 2 :5, s 3 :5},  |d 3 |=13 Threshold:   τ  = 0.8 Break if:   δ 1  +  δ 2  > (1-  τ )|d i | p k p k+1 … … d 3 :5 d 2 :8 d 3 :4 d 1 :5 d 1 :4 Partition k s 3 s 1 d 3 :5 d 3 :4 d 1 :4 s 2
SpotSigs Deduplication Algorithm ,[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections ,[object Object]
Experiments ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
Competitors ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
“ Gold Set” of News Articles ,[object Object],[object Object],Macro- Avg. Cosine ≈ 0.64 June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
SpotSigs vs. Shingling – Gold Set Using (weighted) Jaccard similarity Using Cosine similarity (no pruning!) June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
Runtime Results – TREC WT10g June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections SpotSigs vs. LSH-S using I-Match-S as recall base
Tuning I-Match & LSH on the Gold Set ,[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections Tuning I-Match-S: varying IDF thresholds Tuning LSH-S: varying #hash functions  l for fixed #MinHashes  k = 6
Summary – Gold Set June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections Summary of algorithms at their best F1 spots ( τ  = 0.44 for SpotSigs & LSH)
Summary – TREC WT10g June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections Relative recall of SpotSigs & LSH using I-Match-S as recall base  at  τ  = 1.0 and  τ  = 0.9
Conclusions & Outlook ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
Related Work ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Contenu connexe

Similaire à Spot Sigs

How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?Samet KILICTAS
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and OutTravis Oliphant
 
Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...Fabio Benedetti
 
Learning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingVrije Universiteit Amsterdam
 
Bridging the gap between designers and developers at the Guardian
Bridging the gap between designers and developers at the GuardianBridging the gap between designers and developers at the Guardian
Bridging the gap between designers and developers at the GuardianKaelig Deloumeau-Prigent
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeFishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeArangoDB Database
 
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaMagellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaSpark Summit
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Spark summit europe 2015 magellan
Spark summit europe 2015 magellanSpark summit europe 2015 magellan
Spark summit europe 2015 magellanRam Sriharsha
 
Going beyond Apache Parquet's default settings
Going beyond Apache Parquet's default settingsGoing beyond Apache Parquet's default settings
Going beyond Apache Parquet's default settingsUwe Korn
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
 
Ch22 parallel d_bs_cs561
Ch22 parallel d_bs_cs561Ch22 parallel d_bs_cs561
Ch22 parallel d_bs_cs561Shobhit Saxena
 

Similaire à Spot Sigs (20)

How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
 
Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...
 
Learning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic Programming
 
Bridging the gap between designers and developers at the Guardian
Bridging the gap between designers and developers at the GuardianBridging the gap between designers and developers at the Guardian
Bridging the gap between designers and developers at the Guardian
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeFishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
 
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaMagellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Things to do with OpenStreetMap
Things to do with OpenStreetMapThings to do with OpenStreetMap
Things to do with OpenStreetMap
 
Spark summit europe 2015 magellan
Spark summit europe 2015 magellanSpark summit europe 2015 magellan
Spark summit europe 2015 magellan
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
 
Going beyond Apache Parquet's default settings
Going beyond Apache Parquet's default settingsGoing beyond Apache Parquet's default settings
Going beyond Apache Parquet's default settings
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgets
 
Ch22 parallel d_bs_cs561
Ch22 parallel d_bs_cs561Ch22 parallel d_bs_cs561
Ch22 parallel d_bs_cs561
 
Scalax
ScalaxScalax
Scalax
 

Plus de infoblog

CIDR 2009: James Hamilton Keynote
CIDR 2009: James Hamilton KeynoteCIDR 2009: James Hamilton Keynote
CIDR 2009: James Hamilton Keynoteinfoblog
 
CIDR 2009: Jeff Heer Keynote
CIDR 2009: Jeff Heer KeynoteCIDR 2009: Jeff Heer Keynote
CIDR 2009: Jeff Heer Keynoteinfoblog
 
Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)infoblog
 
Claremont Report on Database Research: Research Directions (Eric A. Brewer)
Claremont Report on Database Research: Research Directions (Eric A. Brewer)Claremont Report on Database Research: Research Directions (Eric A. Brewer)
Claremont Report on Database Research: Research Directions (Eric A. Brewer)infoblog
 
Claremont Report on Database Research: Research Directions (Rakesh Agrawal)
Claremont Report on Database Research: Research Directions (Rakesh Agrawal)Claremont Report on Database Research: Research Directions (Rakesh Agrawal)
Claremont Report on Database Research: Research Directions (Rakesh Agrawal)infoblog
 
Claremont Report on Database Research: Research Directions (Gerhard Weikum)
Claremont Report on Database Research: Research Directions (Gerhard Weikum)Claremont Report on Database Research: Research Directions (Gerhard Weikum)
Claremont Report on Database Research: Research Directions (Gerhard Weikum)infoblog
 
Claremont Report on Database Research: Research Directions (Beng Chin Ooi)
Claremont Report on Database Research: Research Directions (Beng Chin Ooi)Claremont Report on Database Research: Research Directions (Beng Chin Ooi)
Claremont Report on Database Research: Research Directions (Beng Chin Ooi)infoblog
 
Claremont Report on Database Research: Research Directions (Yannis E. Ioannidis)
Claremont Report on Database Research: Research Directions (Yannis E. Ioannidis)Claremont Report on Database Research: Research Directions (Yannis E. Ioannidis)
Claremont Report on Database Research: Research Directions (Yannis E. Ioannidis)infoblog
 
Claremont Report on Database Research: Research Directions (Donald Kossmann)
Claremont Report on Database Research: Research Directions (Donald Kossmann)Claremont Report on Database Research: Research Directions (Donald Kossmann)
Claremont Report on Database Research: Research Directions (Donald Kossmann)infoblog
 
Claremont Report on Database Research: Research Directions (Johannes Gehrke)
Claremont Report on Database Research: Research Directions (Johannes Gehrke)Claremont Report on Database Research: Research Directions (Johannes Gehrke)
Claremont Report on Database Research: Research Directions (Johannes Gehrke)infoblog
 
Claremont Report on Database Research: Research Directions (Alon Y. Halevy)
Claremont Report on Database Research: Research Directions (Alon Y. Halevy)Claremont Report on Database Research: Research Directions (Alon Y. Halevy)
Claremont Report on Database Research: Research Directions (Alon Y. Halevy)infoblog
 
Claremont Report on Database Research: Research Directions (Anastasia Ailamaki)
Claremont Report on Database Research: Research Directions (Anastasia Ailamaki)Claremont Report on Database Research: Research Directions (Anastasia Ailamaki)
Claremont Report on Database Research: Research Directions (Anastasia Ailamaki)infoblog
 
Database Research Principles Revealed (Small Size)
Database Research Principles Revealed (Small Size)Database Research Principles Revealed (Small Size)
Database Research Principles Revealed (Small Size)infoblog
 
Database Research Principles Revealed
Database Research Principles RevealedDatabase Research Principles Revealed
Database Research Principles Revealedinfoblog
 

Plus de infoblog (14)

CIDR 2009: James Hamilton Keynote
CIDR 2009: James Hamilton KeynoteCIDR 2009: James Hamilton Keynote
CIDR 2009: James Hamilton Keynote
 
CIDR 2009: Jeff Heer Keynote
CIDR 2009: Jeff Heer KeynoteCIDR 2009: Jeff Heer Keynote
CIDR 2009: Jeff Heer Keynote
 
Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)
 
Claremont Report on Database Research: Research Directions (Eric A. Brewer)
Claremont Report on Database Research: Research Directions (Eric A. Brewer)Claremont Report on Database Research: Research Directions (Eric A. Brewer)
Claremont Report on Database Research: Research Directions (Eric A. Brewer)
 
Claremont Report on Database Research: Research Directions (Rakesh Agrawal)
Claremont Report on Database Research: Research Directions (Rakesh Agrawal)Claremont Report on Database Research: Research Directions (Rakesh Agrawal)
Claremont Report on Database Research: Research Directions (Rakesh Agrawal)
 
Claremont Report on Database Research: Research Directions (Gerhard Weikum)
Claremont Report on Database Research: Research Directions (Gerhard Weikum)Claremont Report on Database Research: Research Directions (Gerhard Weikum)
Claremont Report on Database Research: Research Directions (Gerhard Weikum)
 
Claremont Report on Database Research: Research Directions (Beng Chin Ooi)
Claremont Report on Database Research: Research Directions (Beng Chin Ooi)Claremont Report on Database Research: Research Directions (Beng Chin Ooi)
Claremont Report on Database Research: Research Directions (Beng Chin Ooi)
 
Claremont Report on Database Research: Research Directions (Yannis E. Ioannidis)
Claremont Report on Database Research: Research Directions (Yannis E. Ioannidis)Claremont Report on Database Research: Research Directions (Yannis E. Ioannidis)
Claremont Report on Database Research: Research Directions (Yannis E. Ioannidis)
 
Claremont Report on Database Research: Research Directions (Donald Kossmann)
Claremont Report on Database Research: Research Directions (Donald Kossmann)Claremont Report on Database Research: Research Directions (Donald Kossmann)
Claremont Report on Database Research: Research Directions (Donald Kossmann)
 
Claremont Report on Database Research: Research Directions (Johannes Gehrke)
Claremont Report on Database Research: Research Directions (Johannes Gehrke)Claremont Report on Database Research: Research Directions (Johannes Gehrke)
Claremont Report on Database Research: Research Directions (Johannes Gehrke)
 
Claremont Report on Database Research: Research Directions (Alon Y. Halevy)
Claremont Report on Database Research: Research Directions (Alon Y. Halevy)Claremont Report on Database Research: Research Directions (Alon Y. Halevy)
Claremont Report on Database Research: Research Directions (Alon Y. Halevy)
 
Claremont Report on Database Research: Research Directions (Anastasia Ailamaki)
Claremont Report on Database Research: Research Directions (Anastasia Ailamaki)Claremont Report on Database Research: Research Directions (Anastasia Ailamaki)
Claremont Report on Database Research: Research Directions (Anastasia Ailamaki)
 
Database Research Principles Revealed (Small Size)
Database Research Principles Revealed (Small Size)Database Research Principles Revealed (Small Size)
Database Research Principles Revealed (Small Size)
 
Database Research Principles Revealed
Database Research Principles RevealedDatabase Research Principles Revealed
Database Research Principles Revealed
 

Dernier

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 

Dernier (20)

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

Spot Sigs

  • 1. SpotSigs Robust & Efficient Near Duplicate Detection in Large Web Collections Martin Theobald Jonathan Siddharth Andreas Paepcke Stanford University Sigir 2008, Singapore
  • 2. Near-Duplicate News Articles (I) June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
  • 3. Near-Duplicate News Articles (II) June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
  • 4. Our Setting June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
  • 5.
  • 6.
  • 7. Case (I): What’s different about the core contents? = stopword occurrences: the , that , {be} , {have} June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
  • 8. Case(II): Do not consider for deduplication! no occurrences of: the, that , {be} , {have} June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14. Done? SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections June 4, 2009
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26. Deduplication Example Deduplicate d 1 S 3 : 1) δ 1 =0, δ 2 =1 -> sim(d 1 ,d 3 ) = 0.8 2) d 1 =d 1 -> continue 3) δ 1 =4, δ 2 =0 -> break! June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections δ 2 =1 δ 1 =4 Given: d 1 = {s 1 :5, s 2 :4, s 3 :4} , |d 1 |=13 d 2 = {s 1 :8, s 2 :4}, |d 2 |=12 d 3 = {s 1 :4, s 2 :5, s 3 :5}, |d 3 |=13 Threshold: τ = 0.8 Break if: δ 1 + δ 2 > (1- τ )|d i | p k p k+1 … … d 3 :5 d 2 :8 d 3 :4 d 1 :5 d 1 :4 Partition k s 3 s 1 d 3 :5 d 3 :4 d 1 :4 s 2
  • 27.
  • 28.
  • 29.
  • 30.
  • 31. SpotSigs vs. Shingling – Gold Set Using (weighted) Jaccard similarity Using Cosine similarity (no pruning!) June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
  • 32. Runtime Results – TREC WT10g June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections SpotSigs vs. LSH-S using I-Match-S as recall base
  • 33.
  • 34. Summary – Gold Set June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections Summary of algorithms at their best F1 spots ( τ = 0.44 for SpotSigs & LSH)
  • 35. Summary – TREC WT10g June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections Relative recall of SpotSigs & LSH using I-Match-S as recall base at τ = 1.0 and τ = 0.9
  • 36.
  • 37.