Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Spot Sigs
1. SpotSigs Robust & Efficient Near Duplicate Detection in Large Web Collections Martin Theobald Jonathan Siddharth Andreas Paepcke Stanford University Sigir 2008, Singapore
2. Near-Duplicate News Articles (I) June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
3. Near-Duplicate News Articles (II) June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
4. Our Setting June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
5.
6.
7. Case (I): What’s different about the core contents? = stopword occurrences: the , that , {be} , {have} June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
8. Case(II): Do not consider for deduplication! no occurrences of: the, that , {be} , {have} June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
9.
10.
11.
12.
13.
14. Done? SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections June 4, 2009
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26. Deduplication Example Deduplicate d 1 S 3 : 1) δ 1 =0, δ 2 =1 -> sim(d 1 ,d 3 ) = 0.8 2) d 1 =d 1 -> continue 3) δ 1 =4, δ 2 =0 -> break! June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections δ 2 =1 δ 1 =4 Given: d 1 = {s 1 :5, s 2 :4, s 3 :4} , |d 1 |=13 d 2 = {s 1 :8, s 2 :4}, |d 2 |=12 d 3 = {s 1 :4, s 2 :5, s 3 :5}, |d 3 |=13 Threshold: τ = 0.8 Break if: δ 1 + δ 2 > (1- τ )|d i | p k p k+1 … … d 3 :5 d 2 :8 d 3 :4 d 1 :5 d 1 :4 Partition k s 3 s 1 d 3 :5 d 3 :4 d 1 :4 s 2
27.
28.
29.
30.
31. SpotSigs vs. Shingling – Gold Set Using (weighted) Jaccard similarity Using Cosine similarity (no pruning!) June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections
32. Runtime Results – TREC WT10g June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections SpotSigs vs. LSH-S using I-Match-S as recall base
33.
34. Summary – Gold Set June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections Summary of algorithms at their best F1 spots ( τ = 0.44 for SpotSigs & LSH)
35. Summary – TREC WT10g June 4, 2009 SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections Relative recall of SpotSigs & LSH using I-Match-S as recall base at τ = 1.0 and τ = 0.9