Reconstructing Textual Documents from n-grams

Motivation: Privacy-preserving data mining
Share textual data for mutual beneﬁt, general good or contractual reasons
But not all of it:
text analytics on private documents
1

But not all of it:
marketplace scenarios [Cancedda ACL 2012]
1

But not all of it:
marketplace scenarios [Cancedda ACL 2012]
copyright concerns
1

Problem
1 Given n-gram information of a document d, how well can we
reconstruct d?
2 If I want/have to share n-gram statistics, what is a good strategy to
avoid reconstruction, while preserving utility of data?
2

Example
s = $ a rose rose is a rose is a rose #
3

Example
2-grams:
$ a 1
a rose 3
rose rose 1
rose is 2
is a 2
rose # 1
3

Example
2-grams:
$ a 1
a rose 3
rose rose 1
rose is 2
is a 2
rose # 1
Note that the same 2-grams are obtained starting from:
s = $ a rose is a rose rose is a rose #
s = $ a rose is a rose is a rose rose #
3

Example
2-grams:
$ a 1
a rose 3
rose rose 1
rose is 2
is a 2
rose # 1
Note that the same 2-grams are obtained starting from:
s = $ a rose is a rose rose is a rose #
s = $ a rose is a rose is a rose rose #
=⇒ Find large chunks of text of whose presence we are
certain
3

Problem Encoding
An n-gram corpus is encoded as a graph, subgraph of the de Bruijn graph, where
edges correspond to n-grams
0
1
$ a , 1
2
a rose , 3
rose rose , 1
3
rose is , 2
4
rose # , 1
is a , 2
4

Problem Encoding
[2, 2, 3, 1] → rose rose is a
0
1
$ a , 1
2
a rose , 3
rose rose , 1
3
rose is , 2
4
rose # , 1
is a , 2
4

Problem encoding
Given such a graph, each Eulerian path gives a plausible reconstruction

Problem encoding
Problem: Find those parts that are common in all of them

Problem encoding
Problem: Find those parts that are common in all of them
BEST Theorem, 1951
Given an Eulerian graph G = (V , E), the number of diﬀerent Eulerian
cycles is
Tw (G)
v∈V
(d(v) − 1)!
Tw (G) is the number of trees directed towards the root at a ﬁxed node w
5

Problem Encoding
[0, 1, 2] → $ a rose
0
1
$ a , 1
2
a rose , 3
rose rose , 1
3
rose is , 2
4
rose # , 1
is a , 2
6

Deﬁnitions
ec(G): the set of all Eulerian paths of G
given the path c = e1, . . . , en; (c) = [label(e1), . . . , label(en)]
s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation)

Deﬁnitions
Given G, we want G∗ st:
1 is equivalent:
{s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G∗
)}
2 is irreducible:
∃e1, e2 ∈ E∗
: [label(e1), label(e2)] appears in all (c), c ∈ ec(G∗
)

Deﬁnitions
Given G, we want G∗ st:
1 is equivalent:
{s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G∗
)}
2 is irreducible:
∃e1, e2 ∈ E∗
: [label(e1), label(e2)] appears in all (c), c ∈ ec(G∗
)
Given G∗ we can just read maximal blocks from the labels.
7

Example
2
rose rose , 1
rose is a rose , 2
4
rose # , 1
0
$ a rose , 1
8

Rule 1 (Pigeonhole rule)
α.δ occurs at least 4 times
10

Rule 2: non-local information
11

x is an “articulation point” [Tarjan 1971]
11

x is an “articulation point” [Tarjan 1971]
α.β occurs at least once
11

Main Result
Theorem
Both rules are correct and complete: their application on G leads to a
graph G∗ that is equivalent to G and irreducible.
12

Experiments
Gutenberg project: out-of-copyright (US) books. 1 000 random single
books.
average maximal
Mean of average and maximal block size
13

Increasing Diversity
Instead of running on a single book, run on concatenation of k books.

Average number of large blocks (≥ 100)

Remove completeness assumption
Remove those n-grams whose frequency is < M.
15

mean / max vs M
(n = 5)
15

mean / max vs M error rate vs M
(n = 5)
15

A better noisifying strategy
Instead of removing n-grams, add strategically chosen n-grams
16

A better noisifying strategy
Instead of removing n-grams, add strategically chosen n-grams
removing edges vs adding edges
16

Keep utility
Removing Adding
17

Conclusions
How well can textual documents be reconstructed from their list of
n-grams
Resilience to standard noisifying approach
Better noisifying by adding (instead of removing) n-grams
18

Rule 1 (Pigeonhole rule)
Incoming edges of x: ( v1, x, 1 , p1), . . . , ( vn, x, n , pn)
Outgoing edges ( x, w1, t1 , k1) . . . , ( x, wm, tm , km).
If ∃i, j such that pi > d(x) − kj .
then
E = E ({ vi , x, i , a), (x, wj , tj , a)}) ∪ { vi , wj , i .tj , a)} where
a = pi − (d(x) − kj ).
if a = d(x) then V = V {x}, else V = V
21

x division point dividing G in components G1, G2. If ˆdinG1
(x) = 1 and
ˆdoutG2
(x) = 1 (( v, x, , p) and ( x, w, t , k)), then
E = (E {( v, x, , 1), ( x, w, t , 1)}) ∪ {( v, w, .t , 1)}
V = V
22

(Mean of average block size)
23

23

Reconstructing Textual Documents from n-grams

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Reconstructing Textual Documents from n-grams

Similaire à Reconstructing Textual Documents from n-grams (20)

Dernier

Dernier (20)

Reconstructing Textual Documents from n-grams