SlideShare a Scribd company logo
1 of 24
Data Deduplication
for Language
Documentation
UNDER THE GUIDANCE OF:-
DR. JAN CHOMICKI AND DR. JEFF GOOD
PRESENTED BY:
KAUSHAL HAKANI, SHAIL PARIKH, SHASHANK RALLAPALLI
Outline
 Introduction
 Challenges
 Steps followed
 Algorithms used
 Approach
 Experimental Results
 Limitations
 Conclusions
Introduction
 13 Villages
 7-9 “languages” spoken
 4 local isolates
 2 dialect clusters
 12000 people
 Localist attitudes
 Various class of people collecting
data
Aim
 Detect duplicate files in the data obtained by the researchers in
Cameroon.
 Decide which files to keep and which to remove.
 Remove duplicate files (De-duplicate)
 Maintain information about provenance of the deleted data.
Dataset
Dataset (continued)
 Initial observations about the dataset reveals that it contains
following types of files
 Audio/Visual
 Audio recordings
 Video recordings
 Photographs/Scanned images
 Textual
 Transcriptions (some time-aligned, XML)
 Questionnaire data
 Lexical data (e.g., vocabulary items in a database)
Dataset (continued)
 Metadata
 Contains information about the actual data files
 System generated file
 Files generated by MAC OS (DS_Store)
There were approximately 231 unique file extensions that we observed
when we parsed the dataset.
Challenges
 Lack of standards in naming convention.
 Decide suitable factor of de-duplication
 File Name based or File Content based
 Decide a suitable factor to take this decision
 Get sample data to run different de-duplication techniques
Challenges (continued)
 Decide what de-duplication methods would be required
 Edit Distance
 Jaccard Similarity
 Checksum and examination of data within file.
 There were few other challenges that we faced
 Come up with appropriate factors to decide what files to delete from
the dataset
 Moving files over different filesystems.
Steps
Initial Filtering
•Group by File Size
•Sampling
Sampled Data
•De-duplicate on file name?
•De-duplicate on file content?
Steps
Experimental Observation
•De-duplicate based on file name
•Decide the de-duplication techniques to be used
Implementation
•Edit Distance
•Jaccard Similarity
•Custom Methods
Steps
Test sample data
•Results were satisfactory
•Also got data to compare results against
Ran on Actual Data
•Could potentially remove 384.41 GB out of a total of 928.45 GB. That is
about 41.4% of the data.
Algorithms
 Used following standard de-duplication algorithms
 Edit-Distance
 Jaccard Similarity (Using n-grams)
 Also used specialized algorithms
 Copy removal (Special to dataset)
 Bus removal (Again, a special method) NOT This →
Edit-Distance
 This algorithm gives the dissimilarity between two strings.
 It calculates the cost of converting a given string two the other one.
 The cost of insert, delete and replacement as 1.
 For example:
String s1 = “Mail Juice-21.gif”
String s2 = “Mail Juice-18.gif”
Example
String1 = “Mail Juice-21.gif”
String2 = “Mail Juice-18.gif”
 Set the cost of insert = 1 , delete = 1 and replacement = 1.
 Total cost of converting S1 to S2 is: 2.
Jaccard Coefficient
 This algorithm measures the similarity of two strings.
 It divides the strings based on decidable factor k.
 Then it calculates the containment of the grams of one string in the
list of grams of other string
 Jaccard Coefficient =
(𝑆1 ∩ 𝑆2)
(𝑆1 ∪ 𝑆2)
Example
String1 = MailJuice21
String2 = MailJuice18
Grams:-
String1[11] = [Mai, ail, il_, l_J, _Ju, Jui, uic, ice, ce_, e_2, _21, 21_]
String2[11] = [Mai, ail, il_, l_J, _Ju, Jui, uic, ice, ce_, e_1, _18, 18_]
S1 U S2 = 15
S1 ∩ S2 = 9
Jaccard Coefficient = 0.6 i.e. 60% Chance that they are similar.
Custom Methods
 There were certain cases were the files were duplicate but name
were not the same.
 For example
FILE NAME FILE SIZE
FOO50407.JPG 1.7 MB
FOO50407 (COPY).WAV 1.7 MB
Experimental Results
On sample data:
WAV, 98%
DELETED FILE SIZE VS TOTAL DELETED FILE SIZE
OTHERS, 2%
Experimental Results
On Total Data:
WAV
94%
FILE SIZE DELETED/TOTAL FILE SIZE DELETED
OTHERS
6%
Generated Log File
The column names from left to right are, new file name, old file name, old directory, size and
timestamp.
Limitations
 We have observed a few limitations that exist in the system we
made.
 Our system isn’t sensitive to the different date formats appearing with in
the file name and treats each of them differently.
 Example: 25-05-2008 and 2008-25-5 are treated differently
 Our system is also insensitive to abbreviations
 Example: MK for MunKen is not taken to be similar
So, human observation is still required to completely de-duplicate the
data, provided the ingestion is unstructured.
Conclusion
 Data de-duplication is a job-specific or to be precise, application-
specific task.
 So, according to given specifications and our implemented logic,
we can safely say, our methods have succeeded in de-duplicating
a huge amount of data and freeing almost 400 GB of the given
hard-drive of 1 TB.
Thank You!!
Questions??

More Related Content

Viewers also liked

Using MongoDB for Materials Discovery
Using MongoDB for Materials DiscoveryUsing MongoDB for Materials Discovery
Using MongoDB for Materials DiscoveryDan Gunter
 
High Rent Vacancy: Not Actually Automatic Deregulation
High Rent Vacancy:  Not Actually Automatic DeregulationHigh Rent Vacancy:  Not Actually Automatic Deregulation
High Rent Vacancy: Not Actually Automatic DeregulationVendomeRealEstateMedia
 
Το ακριτικό τραγούδι στα βαλκάνια
Το ακριτικό τραγούδι στα βαλκάνιαΤο ακριτικό τραγούδι στα βαλκάνια
Το ακριτικό τραγούδι στα βαλκάνιαΕΙΡΗΝΗ ΠΑΞΙΜΑΔΑΚΗ
 
Part 2: Health Abroad (Exchange)
Part 2: Health Abroad (Exchange)Part 2: Health Abroad (Exchange)
Part 2: Health Abroad (Exchange)stjglobal
 
Arpinmeeting 9fd727a4 0c9b-4cb5-8d5b-7cbc2fad672e
Arpinmeeting 9fd727a4 0c9b-4cb5-8d5b-7cbc2fad672eArpinmeeting 9fd727a4 0c9b-4cb5-8d5b-7cbc2fad672e
Arpinmeeting 9fd727a4 0c9b-4cb5-8d5b-7cbc2fad672esuku dim
 
Focus on. Social Technologieën
Focus on. Social Technologieën Focus on. Social Technologieën
Focus on. Social Technologieën Focus Advertising
 
Empathize and define
Empathize and defineEmpathize and define
Empathize and defineLola Garín
 
Diffusion of innovation, consumer attitudes and intentions to use mobile banking
Diffusion of innovation, consumer attitudes and intentions to use mobile bankingDiffusion of innovation, consumer attitudes and intentions to use mobile banking
Diffusion of innovation, consumer attitudes and intentions to use mobile bankingAlexander Decker
 
фонд бие даалт
фонд  бие даалтфонд  бие даалт
фонд бие даалтgjkfdjgkfj
 
The Canvas Prison 999999999999999999
The  Canvas  Prison 999999999999999999The  Canvas  Prison 999999999999999999
The Canvas Prison 999999999999999999Sylvia Gleason
 

Viewers also liked (17)

Using MongoDB for Materials Discovery
Using MongoDB for Materials DiscoveryUsing MongoDB for Materials Discovery
Using MongoDB for Materials Discovery
 
High Rent Vacancy: Not Actually Automatic Deregulation
High Rent Vacancy:  Not Actually Automatic DeregulationHigh Rent Vacancy:  Not Actually Automatic Deregulation
High Rent Vacancy: Not Actually Automatic Deregulation
 
Το ακριτικό τραγούδι στα βαλκάνια
Το ακριτικό τραγούδι στα βαλκάνιαΤο ακριτικό τραγούδι στα βαλκάνια
Το ακριτικό τραγούδι στα βαλκάνια
 
Part 2: Health Abroad (Exchange)
Part 2: Health Abroad (Exchange)Part 2: Health Abroad (Exchange)
Part 2: Health Abroad (Exchange)
 
Soybean peptide protein powder
Soybean peptide protein powderSoybean peptide protein powder
Soybean peptide protein powder
 
Arpinmeeting 9fd727a4 0c9b-4cb5-8d5b-7cbc2fad672e
Arpinmeeting 9fd727a4 0c9b-4cb5-8d5b-7cbc2fad672eArpinmeeting 9fd727a4 0c9b-4cb5-8d5b-7cbc2fad672e
Arpinmeeting 9fd727a4 0c9b-4cb5-8d5b-7cbc2fad672e
 
Focus on. Social Technologieën
Focus on. Social Technologieën Focus on. Social Technologieën
Focus on. Social Technologieën
 
Mar i cel (2)
Mar i cel (2)Mar i cel (2)
Mar i cel (2)
 
Empathize and define
Empathize and defineEmpathize and define
Empathize and define
 
Diffusion of innovation, consumer attitudes and intentions to use mobile banking
Diffusion of innovation, consumer attitudes and intentions to use mobile bankingDiffusion of innovation, consumer attitudes and intentions to use mobile banking
Diffusion of innovation, consumer attitudes and intentions to use mobile banking
 
2015 10-20-guild council
2015 10-20-guild council2015 10-20-guild council
2015 10-20-guild council
 
4 Squares
4 Squares4 Squares
4 Squares
 
фонд бие даалт
фонд  бие даалтфонд  бие даалт
фонд бие даалт
 
Kort presentation
Kort presentationKort presentation
Kort presentation
 
AGIC 2010 Presentation
AGIC 2010 PresentationAGIC 2010 Presentation
AGIC 2010 Presentation
 
The Canvas Prison 999999999999999999
The  Canvas  Prison 999999999999999999The  Canvas  Prison 999999999999999999
The Canvas Prison 999999999999999999
 
Chuck And Geck
Chuck And GeckChuck And Geck
Chuck And Geck
 

Similar to Data De-duplication (Spring 2014)

File System Comparison on Linux Ubuntu
File System Comparison on Linux UbuntuFile System Comparison on Linux Ubuntu
File System Comparison on Linux UbuntuJayesh Tambe
 
Data management for TA's
Data management for TA'sData management for TA's
Data management for TA'saaroncollie
 
Degonto, File management system in fisheries science
Degonto, File management  system in fisheries scienceDegonto, File management  system in fisheries science
Degonto, File management system in fisheries scienceDegonto Islam
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217lyarmey
 
Degonto file management
Degonto file managementDegonto file management
Degonto file managementDegonto Islam
 
Extended subtree a new similarity function for tree structured data
Extended subtree a new similarity function for tree structured dataExtended subtree a new similarity function for tree structured data
Extended subtree a new similarity function for tree structured dataPapitha Velumani
 
Introduction to Data Management
Introduction to Data ManagementIntroduction to Data Management
Introduction to Data ManagementAmanda Whitmire
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and SharingC. Tobin Magle
 
Disk Image!...and then what? Strategies for sustainable long-term storage an...
Disk Image!...and then what?  Strategies for sustainable long-term storage an...Disk Image!...and then what?  Strategies for sustainable long-term storage an...
Disk Image!...and then what? Strategies for sustainable long-term storage an...Helen Bailey
 
Best practices data management
Best practices data managementBest practices data management
Best practices data managementSherry Lake
 
Best practices data collection
Best practices data collectionBest practices data collection
Best practices data collectionSherry Lake
 
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...OpenAIRE
 
Lesson 6 Basic Tutorial Data Analysis Software for Flow Cytometry
Lesson 6 Basic Tutorial Data Analysis Software for Flow CytometryLesson 6 Basic Tutorial Data Analysis Software for Flow Cytometry
Lesson 6 Basic Tutorial Data Analysis Software for Flow CytometryUttam Belbase
 
seed block algorithm
seed block algorithmseed block algorithm
seed block algorithmDipak Badhe
 

Similar to Data De-duplication (Spring 2014) (20)

File System Comparison on Linux Ubuntu
File System Comparison on Linux UbuntuFile System Comparison on Linux Ubuntu
File System Comparison on Linux Ubuntu
 
Data management for TA's
Data management for TA'sData management for TA's
Data management for TA's
 
Degonto, File management system in fisheries science
Degonto, File management  system in fisheries scienceDegonto, File management  system in fisheries science
Degonto, File management system in fisheries science
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
 
Degonto file management
Degonto file managementDegonto file management
Degonto file management
 
data.ppt
data.pptdata.ppt
data.ppt
 
Extended subtree a new similarity function for tree structured data
Extended subtree a new similarity function for tree structured dataExtended subtree a new similarity function for tree structured data
Extended subtree a new similarity function for tree structured data
 
Introduction to Data Management
Introduction to Data ManagementIntroduction to Data Management
Introduction to Data Management
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and Sharing
 
Disk Image!...and then what? Strategies for sustainable long-term storage an...
Disk Image!...and then what?  Strategies for sustainable long-term storage an...Disk Image!...and then what?  Strategies for sustainable long-term storage an...
Disk Image!...and then what? Strategies for sustainable long-term storage an...
 
Best practices data management
Best practices data managementBest practices data management
Best practices data management
 
File Management
File ManagementFile Management
File Management
 
Best practices data collection
Best practices data collectionBest practices data collection
Best practices data collection
 
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
 
Lesson 6 Basic Tutorial Data Analysis Software for Flow Cytometry
Lesson 6 Basic Tutorial Data Analysis Software for Flow CytometryLesson 6 Basic Tutorial Data Analysis Software for Flow Cytometry
Lesson 6 Basic Tutorial Data Analysis Software for Flow Cytometry
 
seed block algorithm
seed block algorithmseed block algorithm
seed block algorithm
 
Good Practice in Research Data Management
Good Practice in Research Data ManagementGood Practice in Research Data Management
Good Practice in Research Data Management
 
Chapter 12.pptx
Chapter 12.pptxChapter 12.pptx
Chapter 12.pptx
 
Overview of the Data Processing Error Analysis System (DPEAS)
Overview of the Data Processing Error Analysis System (DPEAS)Overview of the Data Processing Error Analysis System (DPEAS)
Overview of the Data Processing Error Analysis System (DPEAS)
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 

Recently uploaded

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 

Recently uploaded (20)

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 

Data De-duplication (Spring 2014)

  • 1. Data Deduplication for Language Documentation UNDER THE GUIDANCE OF:- DR. JAN CHOMICKI AND DR. JEFF GOOD PRESENTED BY: KAUSHAL HAKANI, SHAIL PARIKH, SHASHANK RALLAPALLI
  • 2. Outline  Introduction  Challenges  Steps followed  Algorithms used  Approach  Experimental Results  Limitations  Conclusions
  • 3. Introduction  13 Villages  7-9 “languages” spoken  4 local isolates  2 dialect clusters  12000 people  Localist attitudes  Various class of people collecting data
  • 4. Aim  Detect duplicate files in the data obtained by the researchers in Cameroon.  Decide which files to keep and which to remove.  Remove duplicate files (De-duplicate)  Maintain information about provenance of the deleted data.
  • 6. Dataset (continued)  Initial observations about the dataset reveals that it contains following types of files  Audio/Visual  Audio recordings  Video recordings  Photographs/Scanned images  Textual  Transcriptions (some time-aligned, XML)  Questionnaire data  Lexical data (e.g., vocabulary items in a database)
  • 7. Dataset (continued)  Metadata  Contains information about the actual data files  System generated file  Files generated by MAC OS (DS_Store) There were approximately 231 unique file extensions that we observed when we parsed the dataset.
  • 8. Challenges  Lack of standards in naming convention.  Decide suitable factor of de-duplication  File Name based or File Content based  Decide a suitable factor to take this decision  Get sample data to run different de-duplication techniques
  • 9. Challenges (continued)  Decide what de-duplication methods would be required  Edit Distance  Jaccard Similarity  Checksum and examination of data within file.  There were few other challenges that we faced  Come up with appropriate factors to decide what files to delete from the dataset  Moving files over different filesystems.
  • 10. Steps Initial Filtering •Group by File Size •Sampling Sampled Data •De-duplicate on file name? •De-duplicate on file content?
  • 11. Steps Experimental Observation •De-duplicate based on file name •Decide the de-duplication techniques to be used Implementation •Edit Distance •Jaccard Similarity •Custom Methods
  • 12. Steps Test sample data •Results were satisfactory •Also got data to compare results against Ran on Actual Data •Could potentially remove 384.41 GB out of a total of 928.45 GB. That is about 41.4% of the data.
  • 13. Algorithms  Used following standard de-duplication algorithms  Edit-Distance  Jaccard Similarity (Using n-grams)  Also used specialized algorithms  Copy removal (Special to dataset)  Bus removal (Again, a special method) NOT This →
  • 14. Edit-Distance  This algorithm gives the dissimilarity between two strings.  It calculates the cost of converting a given string two the other one.  The cost of insert, delete and replacement as 1.  For example: String s1 = “Mail Juice-21.gif” String s2 = “Mail Juice-18.gif”
  • 15. Example String1 = “Mail Juice-21.gif” String2 = “Mail Juice-18.gif”  Set the cost of insert = 1 , delete = 1 and replacement = 1.  Total cost of converting S1 to S2 is: 2.
  • 16. Jaccard Coefficient  This algorithm measures the similarity of two strings.  It divides the strings based on decidable factor k.  Then it calculates the containment of the grams of one string in the list of grams of other string  Jaccard Coefficient = (𝑆1 ∩ 𝑆2) (𝑆1 ∪ 𝑆2)
  • 17. Example String1 = MailJuice21 String2 = MailJuice18 Grams:- String1[11] = [Mai, ail, il_, l_J, _Ju, Jui, uic, ice, ce_, e_2, _21, 21_] String2[11] = [Mai, ail, il_, l_J, _Ju, Jui, uic, ice, ce_, e_1, _18, 18_] S1 U S2 = 15 S1 ∩ S2 = 9 Jaccard Coefficient = 0.6 i.e. 60% Chance that they are similar.
  • 18. Custom Methods  There were certain cases were the files were duplicate but name were not the same.  For example FILE NAME FILE SIZE FOO50407.JPG 1.7 MB FOO50407 (COPY).WAV 1.7 MB
  • 19. Experimental Results On sample data: WAV, 98% DELETED FILE SIZE VS TOTAL DELETED FILE SIZE OTHERS, 2%
  • 20. Experimental Results On Total Data: WAV 94% FILE SIZE DELETED/TOTAL FILE SIZE DELETED OTHERS 6%
  • 21. Generated Log File The column names from left to right are, new file name, old file name, old directory, size and timestamp.
  • 22. Limitations  We have observed a few limitations that exist in the system we made.  Our system isn’t sensitive to the different date formats appearing with in the file name and treats each of them differently.  Example: 25-05-2008 and 2008-25-5 are treated differently  Our system is also insensitive to abbreviations  Example: MK for MunKen is not taken to be similar So, human observation is still required to completely de-duplicate the data, provided the ingestion is unstructured.
  • 23. Conclusion  Data de-duplication is a job-specific or to be precise, application- specific task.  So, according to given specifications and our implemented logic, we can safely say, our methods have succeeded in de-duplicating a huge amount of data and freeing almost 400 GB of the given hard-drive of 1 TB.

Editor's Notes

  1. Identify duplicate data by sampling
  2. 750 MB of a total of 13 GB.
  3. 395 GB of a total of 900 GB. 2500 files duplicate in the whole directory. Max occurance of duplicates = DS_Store Max size of duplicates = WAV = 370 GB. WAV 370 GB /