SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
Motivation: Privacy-preserving data mining
Share textual data for mutual benefit, general good or contractual reasons
But not all of it:
text analytics on private documents
1
Motivation: Privacy-preserving data mining
Share textual data for mutual benefit, general good or contractual reasons
But not all of it:
text analytics on private documents
marketplace scenarios [Cancedda ACL 2012]
1
Motivation: Privacy-preserving data mining
Share textual data for mutual benefit, general good or contractual reasons
But not all of it:
text analytics on private documents
marketplace scenarios [Cancedda ACL 2012]
copyright concerns
1
Problem
1 Given n-gram information of a document d, how well can we
reconstruct d?
2 If I want/have to share n-gram statistics, what is a good strategy to
avoid reconstruction, while preserving utility of data?
2
Example
s = $ a rose rose is a rose is a rose #
3
Example
s = $ a rose rose is a rose is a rose #
2-grams:
$ a 1
a rose 3
rose rose 1
rose is 2
is a 2
rose # 1
3
Example
s = $ a rose rose is a rose is a rose #
2-grams:
$ a 1
a rose 3
rose rose 1
rose is 2
is a 2
rose # 1
Note that the same 2-grams are obtained starting from:
s = $ a rose is a rose rose is a rose #
s = $ a rose is a rose is a rose rose #
3
Example
s = $ a rose rose is a rose is a rose #
2-grams:
$ a 1
a rose 3
rose rose 1
rose is 2
is a 2
rose # 1
Note that the same 2-grams are obtained starting from:
s = $ a rose is a rose rose is a rose #
s = $ a rose is a rose is a rose rose #
=⇒ Find large chunks of text of whose presence we are
certain
3
Problem Encoding
An n-gram corpus is encoded as a graph, subgraph of the de Bruijn graph, where
edges correspond to n-grams
0
1
$ a , 1
2
a rose , 3
rose rose , 1
3
rose is , 2
4
rose # , 1
is a , 2
4
Problem Encoding
[2, 2, 3, 1] → rose rose is a
0
1
$ a , 1
2
a rose , 3
rose rose , 1
3
rose is , 2
4
rose # , 1
is a , 2
4
Problem encoding
Given such a graph, each Eulerian path gives a plausible reconstruction
Problem encoding
Given such a graph, each Eulerian path gives a plausible reconstruction
Problem: Find those parts that are common in all of them
Problem encoding
Given such a graph, each Eulerian path gives a plausible reconstruction
Problem: Find those parts that are common in all of them
BEST Theorem, 1951
Given an Eulerian graph G = (V , E), the number of different Eulerian
cycles is
Tw (G)
v∈V
(d(v) − 1)!
Tw (G) is the number of trees directed towards the root at a fixed node w
5
Problem Encoding
[0, 1, 2] → $ a rose
0
1
$ a , 1
2
a rose , 3
rose rose , 1
3
rose is , 2
4
rose # , 1
is a , 2
6
Definitions
ec(G): the set of all Eulerian paths of G
given the path c = e1, . . . , en; (c) = [label(e1), . . . , label(en)]
s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation)
Definitions
ec(G): the set of all Eulerian paths of G
given the path c = e1, . . . , en; (c) = [label(e1), . . . , label(en)]
s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation)
Given G, we want G∗ st:
1 is equivalent:
{s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G∗
)}
2 is irreducible:
∃e1, e2 ∈ E∗
: [label(e1), label(e2)] appears in all (c), c ∈ ec(G∗
)
Definitions
ec(G): the set of all Eulerian paths of G
given the path c = e1, . . . , en; (c) = [label(e1), . . . , label(en)]
s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation)
Given G, we want G∗ st:
1 is equivalent:
{s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G∗
)}
2 is irreducible:
∃e1, e2 ∈ E∗
: [label(e1), label(e2)] appears in all (c), c ∈ ec(G∗
)
Given G∗ we can just read maximal blocks from the labels.
7
Example
s = $ a rose rose is a rose is a rose #
2
rose rose , 1
rose is a rose , 2
4
rose # , 1
0
$ a rose , 1
8
9
Rule 1 (Pigeonhole rule)
10
Rule 1 (Pigeonhole rule)
α.δ occurs at least 4 times
10
Rule 2: non-local information
11
Rule 2: non-local information
x is an “articulation point” [Tarjan 1971]
11
Rule 2: non-local information
x is an “articulation point” [Tarjan 1971]
α.β occurs at least once
11
Main Result
Theorem
Both rules are correct and complete: their application on G leads to a
graph G∗ that is equivalent to G and irreducible.
12
Experiments
13
Experiments
Gutenberg project: out-of-copyright (US) books. 1 000 random single
books.
average maximal
Mean of average and maximal block size
13
Increasing Diversity
Instead of running on a single book, run on concatenation of k books.
Increasing Diversity
Instead of running on a single book, run on concatenation of k books.
Average number of large blocks (≥ 100)
Remove completeness assumption
Remove those n-grams whose frequency is < M.
15
Remove completeness assumption
Remove those n-grams whose frequency is < M.
mean / max vs M
(n = 5)
15
Remove completeness assumption
Remove those n-grams whose frequency is < M.
mean / max vs M error rate vs M
(n = 5)
15
A better noisifying strategy
Instead of removing n-grams, add strategically chosen n-grams
16
A better noisifying strategy
Instead of removing n-grams, add strategically chosen n-grams
removing edges vs adding edges
16
Keep utility
17
Keep utility
Removing
17
Keep utility
Removing Adding
17
Conclusions
How well can textual documents be reconstructed from their list of
n-grams
Resilience to standard noisifying approach
Better noisifying by adding (instead of removing) n-grams
18
Questions?
19
Appendix
20
Rule 1 (Pigeonhole rule)
Incoming edges of x: ( v1, x, 1 , p1), . . . , ( vn, x, n , pn)
Outgoing edges ( x, w1, t1 , k1) . . . , ( x, wm, tm , km).
If ∃i, j such that pi > d(x) − kj .
then
E = E  ({ vi , x, i , a), (x, wj , tj , a)}) ∪ { vi , wj , i .tj , a)} where
a = pi − (d(x) − kj ).
if a = d(x) then V = V  {x}, else V = V
21
Rule 2: non-local information
x division point dividing G in components G1, G2. If ˆdinG1
(x) = 1 and
ˆdoutG2
(x) = 1 (( v, x, , p) and ( x, w, t , k)), then
E = (E  {( v, x, , 1), ( x, w, t , 1)}) ∪ {( v, w, .t , 1)}
V = V
22
Increasing Diversity
Instead of running on a single book, run on concatenation of k books.
(Mean of average block size)
23
Increasing Diversity
Instead of running on a single book, run on concatenation of k books.
23

Contenu connexe

Tendances

5.2 arithmetic sequences and sums t
5.2 arithmetic sequences and sums t5.2 arithmetic sequences and sums t
5.2 arithmetic sequences and sums tmath260
 
Zeros of a polynomial function
Zeros of a polynomial functionZeros of a polynomial function
Zeros of a polynomial functionMartinGeraldine
 
Linear equation in two variable
Linear equation in two variableLinear equation in two variable
Linear equation in two variableNadeem Uddin
 
Comp decomp worked
Comp decomp workedComp decomp worked
Comp decomp workedJonna Ramsey
 
1d. Pedagogy of Mathematics (Part II) - Set language introduction and Ex.1.4
1d. Pedagogy of Mathematics (Part II) - Set language introduction and Ex.1.41d. Pedagogy of Mathematics (Part II) - Set language introduction and Ex.1.4
1d. Pedagogy of Mathematics (Part II) - Set language introduction and Ex.1.4Dr. I. Uma Maheswari Maheswari
 
Section 2.1 functions
Section 2.1 functions Section 2.1 functions
Section 2.1 functions Wong Hsiung
 
ゲーム理論BASIC 第20回 -無限回繰り返しゲーム-
ゲーム理論BASIC 第20回 -無限回繰り返しゲーム-ゲーム理論BASIC 第20回 -無限回繰り返しゲーム-
ゲーム理論BASIC 第20回 -無限回繰り返しゲーム-ssusere0a682
 
Multipying polynomial functions
Multipying polynomial functionsMultipying polynomial functions
Multipying polynomial functionsMartinGeraldine
 
Sum and difference of two squares
Sum and difference of two squaresSum and difference of two squares
Sum and difference of two squaresMartinGeraldine
 
Distributive property ppt
Distributive property pptDistributive property ppt
Distributive property pptnglaze10
 
Algorithm_Matroids and greedy methods
Algorithm_Matroids and greedy methodsAlgorithm_Matroids and greedy methods
Algorithm_Matroids and greedy methodsIm Rafid
 
NCERT ARITHMETIC PROGRESSIONS
NCERT ARITHMETIC PROGRESSIONS NCERT ARITHMETIC PROGRESSIONS
NCERT ARITHMETIC PROGRESSIONS AKBAR1961
 
Zeros or roots of a polynomial if a greater than1
Zeros or roots of a polynomial if a greater than1Zeros or roots of a polynomial if a greater than1
Zeros or roots of a polynomial if a greater than1MartinGeraldine
 
Section 3.3 quadratic functions and their properties
Section 3.3 quadratic functions and their properties Section 3.3 quadratic functions and their properties
Section 3.3 quadratic functions and their properties Wong Hsiung
 

Tendances (20)

5.2 arithmetic sequences and sums t
5.2 arithmetic sequences and sums t5.2 arithmetic sequences and sums t
5.2 arithmetic sequences and sums t
 
Zeros of a polynomial function
Zeros of a polynomial functionZeros of a polynomial function
Zeros of a polynomial function
 
Linear equation in two variable
Linear equation in two variableLinear equation in two variable
Linear equation in two variable
 
Invers fungsi
Invers fungsiInvers fungsi
Invers fungsi
 
Evaluating a function
Evaluating a functionEvaluating a function
Evaluating a function
 
Power set
Power setPower set
Power set
 
Comp decomp worked
Comp decomp workedComp decomp worked
Comp decomp worked
 
Evaluating functions
Evaluating functionsEvaluating functions
Evaluating functions
 
1d. Pedagogy of Mathematics (Part II) - Set language introduction and Ex.1.4
1d. Pedagogy of Mathematics (Part II) - Set language introduction and Ex.1.41d. Pedagogy of Mathematics (Part II) - Set language introduction and Ex.1.4
1d. Pedagogy of Mathematics (Part II) - Set language introduction and Ex.1.4
 
Domain alg worked
Domain alg workedDomain alg worked
Domain alg worked
 
Section 2.1 functions
Section 2.1 functions Section 2.1 functions
Section 2.1 functions
 
ゲーム理論BASIC 第20回 -無限回繰り返しゲーム-
ゲーム理論BASIC 第20回 -無限回繰り返しゲーム-ゲーム理論BASIC 第20回 -無限回繰り返しゲーム-
ゲーム理論BASIC 第20回 -無限回繰り返しゲーム-
 
Multipying polynomial functions
Multipying polynomial functionsMultipying polynomial functions
Multipying polynomial functions
 
Sum and difference of two squares
Sum and difference of two squaresSum and difference of two squares
Sum and difference of two squares
 
Distributive property ppt
Distributive property pptDistributive property ppt
Distributive property ppt
 
Algorithm_Matroids and greedy methods
Algorithm_Matroids and greedy methodsAlgorithm_Matroids and greedy methods
Algorithm_Matroids and greedy methods
 
NCERT ARITHMETIC PROGRESSIONS
NCERT ARITHMETIC PROGRESSIONS NCERT ARITHMETIC PROGRESSIONS
NCERT ARITHMETIC PROGRESSIONS
 
Guia 1
Guia 1Guia 1
Guia 1
 
Zeros or roots of a polynomial if a greater than1
Zeros or roots of a polynomial if a greater than1Zeros or roots of a polynomial if a greater than1
Zeros or roots of a polynomial if a greater than1
 
Section 3.3 quadratic functions and their properties
Section 3.3 quadratic functions and their properties Section 3.3 quadratic functions and their properties
Section 3.3 quadratic functions and their properties
 

Similaire à Reconstructing Textual Documents from n-grams

Prime numbers boundary
Prime numbers boundary Prime numbers boundary
Prime numbers boundary Camilo Ulloa
 
Minimizing cost in distributed multiquery processing applications
Minimizing cost in distributed multiquery processing applicationsMinimizing cost in distributed multiquery processing applications
Minimizing cost in distributed multiquery processing applicationsLuis Galárraga
 
Answers withexplanations
Answers withexplanationsAnswers withexplanations
Answers withexplanationsGopi Saiteja
 
Group theory notes
Group theory notesGroup theory notes
Group theory notesmkumaresan
 
Cs6660 compiler design november december 2016 Answer key
Cs6660 compiler design november december 2016 Answer keyCs6660 compiler design november december 2016 Answer key
Cs6660 compiler design november december 2016 Answer keyappasami
 
Lego like spheres and tori, enumeration and drawings
Lego like spheres and tori, enumeration and drawingsLego like spheres and tori, enumeration and drawings
Lego like spheres and tori, enumeration and drawingsMathieu Dutour Sikiric
 
Testing Forest-Isomorphism in the Adjacency List Model
Testing Forest-Isomorphismin the Adjacency List ModelTesting Forest-Isomorphismin the Adjacency List Model
Testing Forest-Isomorphism in the Adjacency List Modelirrrrr
 
Skiena algorithm 2007 lecture02 asymptotic notation
Skiena algorithm 2007 lecture02 asymptotic notationSkiena algorithm 2007 lecture02 asymptotic notation
Skiena algorithm 2007 lecture02 asymptotic notationzukun
 
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...cseiitgn
 
lecture 1
lecture 1lecture 1
lecture 1sajinsc
 
Greek logic and mathematics
Greek logic and mathematicsGreek logic and mathematics
Greek logic and mathematicsBob Marcus
 
ON ALGORITHMIC PROBLEMS CONCERNING GRAPHS OF HIGHER DEGREE OF SYMMETRY
ON ALGORITHMIC PROBLEMS CONCERNING GRAPHS OF HIGHER DEGREE OF SYMMETRYON ALGORITHMIC PROBLEMS CONCERNING GRAPHS OF HIGHER DEGREE OF SYMMETRY
ON ALGORITHMIC PROBLEMS CONCERNING GRAPHS OF HIGHER DEGREE OF SYMMETRYFransiskeran
 

Similaire à Reconstructing Textual Documents from n-grams (20)

Prime numbers boundary
Prime numbers boundary Prime numbers boundary
Prime numbers boundary
 
Minimizing cost in distributed multiquery processing applications
Minimizing cost in distributed multiquery processing applicationsMinimizing cost in distributed multiquery processing applications
Minimizing cost in distributed multiquery processing applications
 
Unit 3
Unit 3Unit 3
Unit 3
 
Unit 3
Unit 3Unit 3
Unit 3
 
Answers withexplanations
Answers withexplanationsAnswers withexplanations
Answers withexplanations
 
Group theory notes
Group theory notesGroup theory notes
Group theory notes
 
Cs6660 compiler design november december 2016 Answer key
Cs6660 compiler design november december 2016 Answer keyCs6660 compiler design november december 2016 Answer key
Cs6660 compiler design november december 2016 Answer key
 
Scribed lec8
Scribed lec8Scribed lec8
Scribed lec8
 
Lego like spheres and tori, enumeration and drawings
Lego like spheres and tori, enumeration and drawingsLego like spheres and tori, enumeration and drawings
Lego like spheres and tori, enumeration and drawings
 
Testing Forest-Isomorphism in the Adjacency List Model
Testing Forest-Isomorphismin the Adjacency List ModelTesting Forest-Isomorphismin the Adjacency List Model
Testing Forest-Isomorphism in the Adjacency List Model
 
Skiena algorithm 2007 lecture02 asymptotic notation
Skiena algorithm 2007 lecture02 asymptotic notationSkiena algorithm 2007 lecture02 asymptotic notation
Skiena algorithm 2007 lecture02 asymptotic notation
 
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
 
graph theory
graph theorygraph theory
graph theory
 
lecture 1
lecture 1lecture 1
lecture 1
 
Exhaustive Combinatorial Enumeration
Exhaustive Combinatorial EnumerationExhaustive Combinatorial Enumeration
Exhaustive Combinatorial Enumeration
 
2.pptx
2.pptx2.pptx
2.pptx
 
Greek logic and mathematics
Greek logic and mathematicsGreek logic and mathematics
Greek logic and mathematics
 
Q
QQ
Q
 
Daa chapter 3
Daa chapter 3Daa chapter 3
Daa chapter 3
 
ON ALGORITHMIC PROBLEMS CONCERNING GRAPHS OF HIGHER DEGREE OF SYMMETRY
ON ALGORITHMIC PROBLEMS CONCERNING GRAPHS OF HIGHER DEGREE OF SYMMETRYON ALGORITHMIC PROBLEMS CONCERNING GRAPHS OF HIGHER DEGREE OF SYMMETRY
ON ALGORITHMIC PROBLEMS CONCERNING GRAPHS OF HIGHER DEGREE OF SYMMETRY
 

Dernier

Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 

Dernier (20)

Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 

Reconstructing Textual Documents from n-grams

  • 1.
  • 2. Motivation: Privacy-preserving data mining Share textual data for mutual benefit, general good or contractual reasons But not all of it: text analytics on private documents 1
  • 3. Motivation: Privacy-preserving data mining Share textual data for mutual benefit, general good or contractual reasons But not all of it: text analytics on private documents marketplace scenarios [Cancedda ACL 2012] 1
  • 4. Motivation: Privacy-preserving data mining Share textual data for mutual benefit, general good or contractual reasons But not all of it: text analytics on private documents marketplace scenarios [Cancedda ACL 2012] copyright concerns 1
  • 5. Problem 1 Given n-gram information of a document d, how well can we reconstruct d? 2 If I want/have to share n-gram statistics, what is a good strategy to avoid reconstruction, while preserving utility of data? 2
  • 6. Example s = $ a rose rose is a rose is a rose # 3
  • 7. Example s = $ a rose rose is a rose is a rose # 2-grams: $ a 1 a rose 3 rose rose 1 rose is 2 is a 2 rose # 1 3
  • 8. Example s = $ a rose rose is a rose is a rose # 2-grams: $ a 1 a rose 3 rose rose 1 rose is 2 is a 2 rose # 1 Note that the same 2-grams are obtained starting from: s = $ a rose is a rose rose is a rose # s = $ a rose is a rose is a rose rose # 3
  • 9. Example s = $ a rose rose is a rose is a rose # 2-grams: $ a 1 a rose 3 rose rose 1 rose is 2 is a 2 rose # 1 Note that the same 2-grams are obtained starting from: s = $ a rose is a rose rose is a rose # s = $ a rose is a rose is a rose rose # =⇒ Find large chunks of text of whose presence we are certain 3
  • 10. Problem Encoding An n-gram corpus is encoded as a graph, subgraph of the de Bruijn graph, where edges correspond to n-grams 0 1 $ a , 1 2 a rose , 3 rose rose , 1 3 rose is , 2 4 rose # , 1 is a , 2 4
  • 11. Problem Encoding [2, 2, 3, 1] → rose rose is a 0 1 $ a , 1 2 a rose , 3 rose rose , 1 3 rose is , 2 4 rose # , 1 is a , 2 4
  • 12. Problem encoding Given such a graph, each Eulerian path gives a plausible reconstruction
  • 13. Problem encoding Given such a graph, each Eulerian path gives a plausible reconstruction Problem: Find those parts that are common in all of them
  • 14. Problem encoding Given such a graph, each Eulerian path gives a plausible reconstruction Problem: Find those parts that are common in all of them BEST Theorem, 1951 Given an Eulerian graph G = (V , E), the number of different Eulerian cycles is Tw (G) v∈V (d(v) − 1)! Tw (G) is the number of trees directed towards the root at a fixed node w 5
  • 15. Problem Encoding [0, 1, 2] → $ a rose 0 1 $ a , 1 2 a rose , 3 rose rose , 1 3 rose is , 2 4 rose # , 1 is a , 2 6
  • 16. Definitions ec(G): the set of all Eulerian paths of G given the path c = e1, . . . , en; (c) = [label(e1), . . . , label(en)] s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation)
  • 17. Definitions ec(G): the set of all Eulerian paths of G given the path c = e1, . . . , en; (c) = [label(e1), . . . , label(en)] s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation) Given G, we want G∗ st: 1 is equivalent: {s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G∗ )} 2 is irreducible: ∃e1, e2 ∈ E∗ : [label(e1), label(e2)] appears in all (c), c ∈ ec(G∗ )
  • 18. Definitions ec(G): the set of all Eulerian paths of G given the path c = e1, . . . , en; (c) = [label(e1), . . . , label(en)] s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation) Given G, we want G∗ st: 1 is equivalent: {s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G∗ )} 2 is irreducible: ∃e1, e2 ∈ E∗ : [label(e1), label(e2)] appears in all (c), c ∈ ec(G∗ ) Given G∗ we can just read maximal blocks from the labels. 7
  • 19. Example s = $ a rose rose is a rose is a rose # 2 rose rose , 1 rose is a rose , 2 4 rose # , 1 0 $ a rose , 1 8
  • 20. 9
  • 21. Rule 1 (Pigeonhole rule) 10
  • 22. Rule 1 (Pigeonhole rule) α.δ occurs at least 4 times 10
  • 23. Rule 2: non-local information 11
  • 24. Rule 2: non-local information x is an “articulation point” [Tarjan 1971] 11
  • 25. Rule 2: non-local information x is an “articulation point” [Tarjan 1971] α.β occurs at least once 11
  • 26. Main Result Theorem Both rules are correct and complete: their application on G leads to a graph G∗ that is equivalent to G and irreducible. 12
  • 28. Experiments Gutenberg project: out-of-copyright (US) books. 1 000 random single books. average maximal Mean of average and maximal block size 13
  • 29. Increasing Diversity Instead of running on a single book, run on concatenation of k books.
  • 30. Increasing Diversity Instead of running on a single book, run on concatenation of k books. Average number of large blocks (≥ 100)
  • 31. Remove completeness assumption Remove those n-grams whose frequency is < M. 15
  • 32. Remove completeness assumption Remove those n-grams whose frequency is < M. mean / max vs M (n = 5) 15
  • 33. Remove completeness assumption Remove those n-grams whose frequency is < M. mean / max vs M error rate vs M (n = 5) 15
  • 34. A better noisifying strategy Instead of removing n-grams, add strategically chosen n-grams 16
  • 35. A better noisifying strategy Instead of removing n-grams, add strategically chosen n-grams removing edges vs adding edges 16
  • 39. Conclusions How well can textual documents be reconstructed from their list of n-grams Resilience to standard noisifying approach Better noisifying by adding (instead of removing) n-grams 18
  • 42. Rule 1 (Pigeonhole rule) Incoming edges of x: ( v1, x, 1 , p1), . . . , ( vn, x, n , pn) Outgoing edges ( x, w1, t1 , k1) . . . , ( x, wm, tm , km). If ∃i, j such that pi > d(x) − kj . then E = E ({ vi , x, i , a), (x, wj , tj , a)}) ∪ { vi , wj , i .tj , a)} where a = pi − (d(x) − kj ). if a = d(x) then V = V {x}, else V = V 21
  • 43. Rule 2: non-local information x division point dividing G in components G1, G2. If ˆdinG1 (x) = 1 and ˆdoutG2 (x) = 1 (( v, x, , p) and ( x, w, t , k)), then E = (E {( v, x, , 1), ( x, w, t , 1)}) ∪ {( v, w, .t , 1)} V = V 22
  • 44. Increasing Diversity Instead of running on a single book, run on concatenation of k books. (Mean of average block size) 23
  • 45. Increasing Diversity Instead of running on a single book, run on concatenation of k books. 23