Spot Sigs

SpotSigs Robust & Efficient Near Duplicate Detection in Large Web Collections Martin Theobald Jonathan Siddharth Andreas Paepcke Stanford University Sigir 2008, Singapore

Near-Duplicate News Articles (I) June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Near-Duplicate News Articles (II) June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Our Setting June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

… but ,[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

What is SpotSigs? ,[object Object],[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Case (I): What’s different about the core contents? = stopword occurrences: the , that , {be} , {have} June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Case(II): Do not consider for deduplication! no occurrences of: the, that , {be} , {have} June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Spot Signature Extraction ,[object Object],[object Object],[object Object],antecedent nearby n-gram Spot Signature s June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Spot Signature Extraction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Signature Extraction Example ,[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections S = { a :rally:kick, a :weeklong:campain, the :south:carolina, the :record:straight, an :attack:circulating, the :internet:designed, is :designed:play}

Signature Extraction Algorithm ,[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Choice of Antecedents ,[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Done? SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections June 4, 2009

How to deduplicate a large collection efficiently ? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Which documents (not) to compare? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object], Idea: Two signature sets A , B can only have high (Jaccard) similarity if they are of similar cardinality! ? June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Upper bound for Jaccard ,[object Object],[object Object],[object Object],[object Object],[object Object],(for |B|≥|A|, w.l.o.g.) June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Multi-set Generalization ,[object Object],[object Object],[object Object],|A| |B| June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Partitioning the Collection ,[object Object],[object Object],[object Object],#sigs per doc ,[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections S i S j ?

Partitioning the Collection ,[object Object],[object Object],[object Object],#sigs per doc ,[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections S i S j Also: Partition widths should be a function of τ

Optimal Partitioning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object], But expensive to solve exactly … June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Approximate Solution ,[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections “ Starting with p 0 = 1, for any given p k , choose p k+1 as the smallest integer p k+1 > p k s.t. p k+1 − p k > (1 − τ ) p k+1 ” E.g. (for τ =0.7 ) : p 0 =1 , p 1 =3 , p 2 =6 , p 3 =10 ,…, p 7 =43 , p 8 =59,…

Partitioning Effects ,[object Object],[object Object],Uniform bucket widths Progressively increasing bucket widths #docs #sigs #docs #sigs June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

… but ,[object Object],[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Inverted Index Pruning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections the:campaign an:attack d 7 :8 d 6 :6 d 7 :4 … … d 2 :6 d 5 :3 d 1 :3 d 1 :5 d 5 :4 Partition k

Deduplication Example Deduplicate d 1 S 3 : 1) δ 1 =0, δ 2 =1 -> sim(d 1 ,d 3 ) = 0.8 2) d 1 =d 1 -> continue 3) δ 1 =4, δ 2 =0 -> break! June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections δ 2 =1 δ 1 =4 Given: d 1 = {s 1 :5, s 2 :4, s 3 :4} , |d 1 |=13 d 2 = {s 1 :8, s 2 :4}, |d 2 |=12 d 3 = {s 1 :4, s 2 :5, s 3 :5}, |d 3 |=13 Threshold: τ = 0.8 Break if: δ 1 + δ 2 > (1- τ )|d i | p k p k+1 … … d 3 :5 d 2 :8 d 3 :4 d 1 :5 d 1 :4 Partition k s 3 s 1 d 3 :5 d 3 :4 d 1 :4 s 2

SpotSigs Deduplication Algorithm ,[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections ,[object Object]

Experiments ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Competitors ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

“ Gold Set” of News Articles ,[object Object],[object Object],Macro- Avg. Cosine ≈ 0.64 June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

SpotSigs vs. Shingling – Gold Set Using (weighted) Jaccard similarity Using Cosine similarity (no pruning!) June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Runtime Results – TREC WT10g June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections SpotSigs vs. LSH-S using I-Match-S as recall base

Tuning I-Match & LSH on the Gold Set ,[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections Tuning I-Match-S: varying IDF thresholds Tuning LSH-S: varying #hash functions l for fixed #MinHashes k = 6

Summary – Gold Set June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections Summary of algorithms at their best F1 spots ( τ = 0.44 for SpotSigs & LSH)

Summary – TREC WT10g June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections Relative recall of SpotSigs & LSH using I-Match-S as recall base at τ = 1.0 and τ = 0.9

Conclusions & Outlook ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Related Work ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Spot Sigs

Recommandé

Recommandé

Contenu connexe

Similaire à Spot Sigs

Similaire à Spot Sigs (20)

Plus de infoblog

Plus de infoblog (14)

Dernier

Dernier (20)

Spot Sigs