The document proposes a content inspection technique for detecting sensitive data leakage. It involves aligning samples from sensitive data and content to compare similarity. Two algorithms are used - a comparable sampling algorithm and a sampling oblivious alignment algorithm. This alignment method promises high-speed security scanning while achieving high detection specificity and tolerance to pattern variation.
2. INTRODUCTION
SENSITIVE DATAS IN COMPANIES
DATA LEAKAGE-------HOW???
DANGER…
TOWARDS SECURITY
EXISTING SYSTEM
PROPOSED SYSTEM
INTO THE ALGORITHM
CONCLUSION
3. DATA LEAKAGE:
Data leakage is the unauthorized
transmission of sensitive data or
information from within an organization
to an external destination .
5. •In the course of business, data must be
handed over to trusted 3rd Parties for
some operations.
•Sometimes these trusted 3rd
Parties may act as points of
Data leakage.
•Data Leakage mainly
happens due to
Human Errors.
6. •A hospital may give patient records to
researcher who will devise new treatment.
•Company may have partnership with other
companies that require sharing of customer
data.
•An enterprise may outsource
it’s data processing, so data
must be given to various other
companies.
7.
8. •Number of leaked sensitive data records has
grown 10 times in recent years.
•Data leakage by accidents exceeds the risk posed
by vulnerable software.
•Sensitive data leakage is more in cases where
there is no End-to-End encryption (example: PGP-
Pretty Good Privacy)
9. •Prevent clear text sensitive Data from Direct Access.
•Deploy a Screening Tool:
-To scan computer file systems.
-To scan server storage.
-Inspect outbound network traffic.
•Data leak detection differs from AntiVirus and Network
Intrusion Detection System (AV&NIDS).
10. ->New security requirements
&
->Algorithmic Challenges.
Algorithmic Challenges:
-Data Transformation
-Scalability
•Direct usage of Automata-based string matching
is not possible.
11. It is based on Set Intersection.
Operation performed on 2 sets
of n-grams.
One from content and one from sensitive data.
This method is used to detect similar
documents on:
•The web.
•Shared malicious traffic pattern.
•Malware.
•E-mail spam.
12. Symantec DLP
Identity Finder
Global Velocity
GoCloud DLP etc.
13. Set Intersection is order less.
(Ordering of shared n-grams is not analyzed)
Generates false alerts.
(When n is set to small value)
Cannot detect the partial data leakage.
It is not an adequate method.
14. This one is holding sequential alignment
algorithm.
Executed on :
•Sampled sensitive data sequence.
•Sampled content being inspected.
Alignment produces the amount of sensitive data
in a content.
More accuracy is achieved.
15. Scalability issue is solved by sampling both the
Sensitive Data & Content Sequence before aligning.
A pair of algorithms is used:
•Comparable Sampling Algorithm
•Sampling Oblivious Alignment Algorithm
High detection specificity.
Pervasive & localized modifications.
16. o The Comparable Sampling Algorithm yields
constant samples of a sequence wherever
the sampling starts and ends
o The Sampling Oblivious Alignment
Algorithm infers the similarity between the
original unsampled sequence with
sophisticated techniques through dynamic
programming.
17. In this method, both sensitive data &
content sequence are sampled.
The alignment is performed on sampled
sequences
Here, a ‘Comparable Sampling’ property is
used.
Both the algorithms performs more faster
on a GPU than a CPU.
Promises high speed security scanning.
19. Requirements:
Definition 1: A substring is a consecutive
segment of the original string.
Definition 2: A subsequence does not
require its items to be consecutive in the
original string.
20. Definition 3: Given string x is substring
of y ,comparable sampling on x and y
yields x’ and y’. x’ is similar to a
substring of y’.
Definition 4: Given x as a substring of
y, a subsequence preserving sampling on
x and y yield two subsequences x’ and y’
,so that x’ is substring of y’.
21. It is deterministic and subsequence
preserving.
This algorithm is unbiased.
It yields a constant samples of a
sequence wherever the sampling starts
and ends.
22. Input: an array S of items, a size |w| for a sliding
window w, a
selection function f (w, N) that selects N smallest
items from a
window w, i.e., f = min(w, N)
Output: a sampled array T
1: initialize T as an empty array of size |S|
2: w ←read(S, |w|)
3: let w.head and w.tail be indices in S
corresponding to the
higher-indexed end and lower-indexed end of w,
respectively
4: collection mc ← min(w, N)
5: while w is within the boundary of S do
23. 6: mp ←mc
7: move w toward high index by 1
8: mc ← min(w, N)
9: if mc = mp then
10: item en ← collectionDiff (mc,mp)
11: item eo ← collectionDiff (mp,mc)
12: if en < eo then
13: write value en to T at w.head’s position
14: else
15: write value eo to T at w.tail’s position
16: end if
17: end if
18: end while
24. We set our sampling procedure with a sliding window
of size 6 (i.e., |w| = 6) and N= 3. The input
sequence is 1,5,1,9,8,5,3,2,4,8. The initial window
w= [1,5,1,9,8,5] and collection mc = sliding{1,1,5}.
25. The complexity of selection function is
O(n log|w|) or O(n),where n is the size of
input, |w| is the size of the window.
The factor O(log|w|) comes from
maintaining the smallest N items within
the window.
26. Requirements:
The algorithm runs on compact sampled sequences L .
Extra fields for scoring matrix cells in dynamic
programming.
Extra step in recurrence relation for updating the null
region.
Complex weight function computes similarities
between two null region.
27. Order –aware comparison
High Tolerance to pattern variation
Capability of detecting partial leaks
Consistent
28. Input: A weight function fw, visited cells in
H matrix that are
adjacent to H(i, j ): H(i −1, j −1), H(i, j −1),
and H(i −1, j ),
and the i -th and j -th items Lai,Lbj
in two sampled sequences La
and Lb, respectively.
29.
30. •Presented here is a content inspection technique
for sensitive data leakage.
•Detection approach is based on aligning 2
samples for similarity comparison.
•Our alignment method is useful for common data
scenarios.