SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
Digging into File Formats:
Poking around at data using file,
DROID, JHOVE, and more
Presented by Stephen Eisenhauer
UNT Libraries TechTalks
February 12, 2014
Why?
⬝ We handle a lot of digital information
⬝ It’s not always readily identifiable
⬞ Names/extensions can be meaningless
⬞ Recovered data may have no names or
metadata at all

“I totally know what’s in this folder.”
Why?
⬝ Sometimes we just need to verify a file is
what it is supposed to be
⬞ “Why won’t this video open?”
“What? It’s actually a HTML 404 document??”

⬝ Maybe you want to automate
⬞ Statistical analysis, reporting, workflow, etc.
What’s in a file, anyway?
⬝ Files are sequences of numeric values
⬝ Those values are meaningless if you don’t
know what they represent
⬞ ASCII characters? Colors?
Something more complex?
What’s in a file, anyway?
⬝ Filenames aren’t stored inside the file
⬝ File extensions are really just hints
⬝ Metadata only exists within a file if the
format specifies it (MP3, PDF, DOC…)
So, how can we tell what’s in mystery data?
⬝ File Identification Tools: Software trained to
look for certain special patterns in data to
determine its file format
⬝ Usually known as “magic numbers”
Fun fact: Java class files all start
with the hexadecimal number
CAFEBABE or CAFED00D.
The unix file command
⬝ Comes installed on Mac OSX and most Linux
operating systems
⬝ (For Ubuntu, just install the “file” package)
⬝ A very quick way to spot-check files
DROID: Digital Record and Object Identification
⬝ Developed by the U.K. National Archives
⬝ Fully free and Open Source
⬝ Uses the industry-standard PRONOM
registry of file format information
⬝ Oriented toward large batches of files
⬝ Comes with a graphical user interface in
addition to a command-line tool
Getting DROID
⬝ Works on Windows, Mac, Linux
⬝ Requires Java
⬝ Download from nationalarchives.gov.uk
a. Click “Download the current version of DROID”
b. Extract the ZIP file to your Desktop (anywhere, really)
c. Run droid.bat (or droid.sh on Linux/OSX) to launch
Let’s make our first DROID profile
⬝ After checking for updates, you’ll see an
empty workspace labeled “Untitled-1”
⬝ This is a “profile” in DROID terms; it
represents a set of data you’re working on
⬝ Add some files/folders to this profile using
the Add (+) button
Let DROID do its thing
⬝ Once you’ve added the files you want to
analyze, click the Start button
⬝ When DROID is finished, you will see the
columns in the profile populate with
information
We did it!
⬝ You can now save this profile and open it later
using DROID without needing to analyze the
data again
⬝ You can also use DROID’s handy features:
⬞ Export lets you save the analysis as a CSV spreadsheet
⬞ Filter lets you drill down if your dataset is large
⬞ Report offers a range of statistical reports that can be
generated with the analysis results
What’s the impact?
⬝ file and DROID are commonly used within
the digital preservation scene
⬝ Institutional repositories often integrate
with these tools
⬝ Data curators use these tools when
ensuring quality and integrity
⬝ Software package including Archivematica,
FITS, and the FCLA Description Service
integrate with these tools out-of-the-box
Other tools to be aware of
⬝ JHOVE (and JHOVE2)
⬞ Determines whether data of a known format is valid

⬝ FITS (File Information Tool Set)
⬞ Analyzes data using a wide range of tools (including
DROID and JHOVE) to look at it from every angle

⬝ FCLA Description Service
⬞ Web-based application that analyzes a single file
using DROID and JHOVE and produces a PREMIS XML
document containing the results
Links to project web sites
DROID:_http://nationalarchives.gov.
uk/information-management/projects-andwork/droid.htm
JHOVE: http://jhove.sourceforge.net/
JHOVE2: https://bitbucket.org/jhove2/main/wiki
FITS: http://fitstool.org
Description Service: http://description.fcla.edu/
And quick primer on all of these tools:
http://metaarchive.org/imls/index.php/Format_Recognition_Tools_Documentation_for_ETDs

Thanks for attending!

Contenu connexe

Tendances

A basic course on Research data management, part 3: sharing your data
A basic course on Research data management, part 3: sharing your dataA basic course on Research data management, part 3: sharing your data
A basic course on Research data management, part 3: sharing your dataLeon Osinski
 
Concept of computer files
Concept of computer filesConcept of computer files
Concept of computer filesSamuel Igbanogu
 
Indexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .netIndexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .netStephen Lorello
 
File organization
File organizationFile organization
File organizationGokul017
 
Linked Open Data: A simple how-to
Linked Open Data: A simple how-toLinked Open Data: A simple how-to
Linked Open Data: A simple how-tonvitucci
 
Connections that work: Linked Open Data demystified
Connections that work: Linked Open Data demystifiedConnections that work: Linked Open Data demystified
Connections that work: Linked Open Data demystifiedJakob .
 
Using Encase for Digital Investigations
Using Encase for Digital InvestigationsUsing Encase for Digital Investigations
Using Encase for Digital InvestigationsDFLABS SRL
 
Scripting User Contributed Interlinking
Scripting User Contributed InterlinkingScripting User Contributed Interlinking
Scripting User Contributed Interlinkingwhalb
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured DocumentsJulián Urbano
 
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...gagravarr
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and SharingC. Tobin Magle
 
Linked (Open) Data: A quick introduction
Linked (Open) Data: A quick introductionLinked (Open) Data: A quick introduction
Linked (Open) Data: A quick introductionnvitucci
 
NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword Haitham El-Ghareeb
 
File organization 1
File organization 1File organization 1
File organization 1Rupali Rana
 

Tendances (20)

A basic course on Research data management, part 3: sharing your data
A basic course on Research data management, part 3: sharing your dataA basic course on Research data management, part 3: sharing your data
A basic course on Research data management, part 3: sharing your data
 
Concept of computer files
Concept of computer filesConcept of computer files
Concept of computer files
 
File organisation
File organisationFile organisation
File organisation
 
Indexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .netIndexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .net
 
File organization
File organizationFile organization
File organization
 
EDI Training Module 9: Explore EML with XML Editors
EDI Training Module 9:  Explore EML with XML EditorsEDI Training Module 9:  Explore EML with XML Editors
EDI Training Module 9: Explore EML with XML Editors
 
Linked Open Data: A simple how-to
Linked Open Data: A simple how-toLinked Open Data: A simple how-to
Linked Open Data: A simple how-to
 
Connections that work: Linked Open Data demystified
Connections that work: Linked Open Data demystifiedConnections that work: Linked Open Data demystified
Connections that work: Linked Open Data demystified
 
Using Encase for Digital Investigations
Using Encase for Digital InvestigationsUsing Encase for Digital Investigations
Using Encase for Digital Investigations
 
Scripting User Contributed Interlinking
Scripting User Contributed InterlinkingScripting User Contributed Interlinking
Scripting User Contributed Interlinking
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
 
Apache Tika
Apache TikaApache Tika
Apache Tika
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and Sharing
 
Linked (Open) Data: A quick introduction
Linked (Open) Data: A quick introductionLinked (Open) Data: A quick introduction
Linked (Open) Data: A quick introduction
 
Indexing
IndexingIndexing
Indexing
 
NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword
 
File organization 1
File organization 1File organization 1
File organization 1
 
02222016
0222201602222016
02222016
 
Web of data
Web of dataWeb of data
Web of data
 

En vedette

Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3POSCAPE Project
 
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...JISC KeepIt project
 
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsCochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsFuture Perfect 2012
 
Pain points for preservation services / workflows in repositories
Pain points for preservation services /  workflows in repositories Pain points for preservation services /  workflows in repositories
Pain points for preservation services / workflows in repositories prwheatley
 
Unified characterisation, please
Unified characterisation, pleaseUnified characterisation, please
Unified characterisation, pleaseAndy Jackson
 
Preservation content in_files
Preservation content in_filesPreservation content in_files
Preservation content in_filesRichard Wright
 

En vedette (7)

Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
 
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
 
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsCochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and Formats
 
Pain points for preservation services / workflows in repositories
Pain points for preservation services /  workflows in repositories Pain points for preservation services /  workflows in repositories
Pain points for preservation services / workflows in repositories
 
Unified characterisation, please
Unified characterisation, pleaseUnified characterisation, please
Unified characterisation, please
 
[Dpf manager] berlin workshop
[Dpf manager] berlin workshop[Dpf manager] berlin workshop
[Dpf manager] berlin workshop
 
Preservation content in_files
Preservation content in_filesPreservation content in_files
Preservation content in_files
 

Similaire à Digging into File Formats: Poking around at data using file, DROID, JHOVE, and more

Dan Crowley - Jack Of All Formats
Dan Crowley - Jack Of All FormatsDan Crowley - Jack Of All Formats
Dan Crowley - Jack Of All FormatsSource Conference
 
Understanding EDP (Electronic Data Processing) Environment
Understanding EDP (Electronic Data Processing) EnvironmentUnderstanding EDP (Electronic Data Processing) Environment
Understanding EDP (Electronic Data Processing) EnvironmentAdetula Bunmi
 
File Handling In C++(OOPs))
File Handling In C++(OOPs))File Handling In C++(OOPs))
File Handling In C++(OOPs))Papu Kumar
 
Data Science Process.pptx
Data Science Process.pptxData Science Process.pptx
Data Science Process.pptxWidsoulDevil
 
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation WorkshopHKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation Workshopl_ernest
 
File Management (1).pptx
File Management (1).pptxFile Management (1).pptx
File Management (1).pptxSolomonAnab1
 
Unit 4 File and Data Management
Unit 4 File and Data ManagementUnit 4 File and Data Management
Unit 4 File and Data ManagementSoushilove
 
Unit 4 file and data management
Unit 4 file and data managementUnit 4 file and data management
Unit 4 file and data managementSoushilove
 
Unit 4 File and Data Management
Unit 4 File and Data ManagementUnit 4 File and Data Management
Unit 4 File and Data ManagementSoushilove
 
841- Advanced Computer ForensicsUnix Forensics LabDue Date.docx
841- Advanced Computer ForensicsUnix Forensics LabDue Date.docx841- Advanced Computer ForensicsUnix Forensics LabDue Date.docx
841- Advanced Computer ForensicsUnix Forensics LabDue Date.docxevonnehoggarth79783
 
Chapter 8 Common Forensic ToolsOverviewIn this chapter, youl.docx
Chapter 8 Common Forensic ToolsOverviewIn this chapter, youl.docxChapter 8 Common Forensic ToolsOverviewIn this chapter, youl.docx
Chapter 8 Common Forensic ToolsOverviewIn this chapter, youl.docxchristinemaritza
 
Jack of all Formats
Jack of all FormatsJack of all Formats
Jack of all FormatsBaronZor
 

Similaire à Digging into File Formats: Poking around at data using file, DROID, JHOVE, and more (20)

Dan Crowley - Jack Of All Formats
Dan Crowley - Jack Of All FormatsDan Crowley - Jack Of All Formats
Dan Crowley - Jack Of All Formats
 
C) ICT Application
C) ICT ApplicationC) ICT Application
C) ICT Application
 
Folder Watching For Automated Document Capture, Batch Scanning
Folder Watching For Automated Document Capture, Batch ScanningFolder Watching For Automated Document Capture, Batch Scanning
Folder Watching For Automated Document Capture, Batch Scanning
 
Digital Library Software
Digital Library SoftwareDigital Library Software
Digital Library Software
 
Understanding EDP (Electronic Data Processing) Environment
Understanding EDP (Electronic Data Processing) EnvironmentUnderstanding EDP (Electronic Data Processing) Environment
Understanding EDP (Electronic Data Processing) Environment
 
File Handling In C++(OOPs))
File Handling In C++(OOPs))File Handling In C++(OOPs))
File Handling In C++(OOPs))
 
Data Science Process.pptx
Data Science Process.pptxData Science Process.pptx
Data Science Process.pptx
 
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation WorkshopHKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
 
File Management (1).pptx
File Management (1).pptxFile Management (1).pptx
File Management (1).pptx
 
INT 1010 04-5.pdf
INT 1010 04-5.pdfINT 1010 04-5.pdf
INT 1010 04-5.pdf
 
Unit 4 File and Data Management
Unit 4 File and Data ManagementUnit 4 File and Data Management
Unit 4 File and Data Management
 
Unit 4 file and data management
Unit 4 file and data managementUnit 4 file and data management
Unit 4 file and data management
 
Unit 4 File and Data Management
Unit 4 File and Data ManagementUnit 4 File and Data Management
Unit 4 File and Data Management
 
Edubooktraining
EdubooktrainingEdubooktraining
Edubooktraining
 
Batch Document Processing with ImageRamp Batch
Batch Document Processing with ImageRamp BatchBatch Document Processing with ImageRamp Batch
Batch Document Processing with ImageRamp Batch
 
841- Advanced Computer ForensicsUnix Forensics LabDue Date.docx
841- Advanced Computer ForensicsUnix Forensics LabDue Date.docx841- Advanced Computer ForensicsUnix Forensics LabDue Date.docx
841- Advanced Computer ForensicsUnix Forensics LabDue Date.docx
 
Chapter 8 Common Forensic ToolsOverviewIn this chapter, youl.docx
Chapter 8 Common Forensic ToolsOverviewIn this chapter, youl.docxChapter 8 Common Forensic ToolsOverviewIn this chapter, youl.docx
Chapter 8 Common Forensic ToolsOverviewIn this chapter, youl.docx
 
File Handling In C++
File Handling In C++File Handling In C++
File Handling In C++
 
file_c.pdf
file_c.pdffile_c.pdf
file_c.pdf
 
Jack of all Formats
Jack of all FormatsJack of all Formats
Jack of all Formats
 

Dernier

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Dernier (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Digging into File Formats: Poking around at data using file, DROID, JHOVE, and more

  • 1. Digging into File Formats: Poking around at data using file, DROID, JHOVE, and more Presented by Stephen Eisenhauer UNT Libraries TechTalks February 12, 2014
  • 2. Why? ⬝ We handle a lot of digital information ⬝ It’s not always readily identifiable ⬞ Names/extensions can be meaningless ⬞ Recovered data may have no names or metadata at all “I totally know what’s in this folder.”
  • 3. Why? ⬝ Sometimes we just need to verify a file is what it is supposed to be ⬞ “Why won’t this video open?” “What? It’s actually a HTML 404 document??” ⬝ Maybe you want to automate ⬞ Statistical analysis, reporting, workflow, etc.
  • 4. What’s in a file, anyway? ⬝ Files are sequences of numeric values ⬝ Those values are meaningless if you don’t know what they represent ⬞ ASCII characters? Colors? Something more complex?
  • 5. What’s in a file, anyway? ⬝ Filenames aren’t stored inside the file ⬝ File extensions are really just hints ⬝ Metadata only exists within a file if the format specifies it (MP3, PDF, DOC…)
  • 6. So, how can we tell what’s in mystery data? ⬝ File Identification Tools: Software trained to look for certain special patterns in data to determine its file format ⬝ Usually known as “magic numbers” Fun fact: Java class files all start with the hexadecimal number CAFEBABE or CAFED00D.
  • 7. The unix file command ⬝ Comes installed on Mac OSX and most Linux operating systems ⬝ (For Ubuntu, just install the “file” package) ⬝ A very quick way to spot-check files
  • 8. DROID: Digital Record and Object Identification ⬝ Developed by the U.K. National Archives ⬝ Fully free and Open Source ⬝ Uses the industry-standard PRONOM registry of file format information ⬝ Oriented toward large batches of files ⬝ Comes with a graphical user interface in addition to a command-line tool
  • 9. Getting DROID ⬝ Works on Windows, Mac, Linux ⬝ Requires Java ⬝ Download from nationalarchives.gov.uk a. Click “Download the current version of DROID” b. Extract the ZIP file to your Desktop (anywhere, really) c. Run droid.bat (or droid.sh on Linux/OSX) to launch
  • 10. Let’s make our first DROID profile ⬝ After checking for updates, you’ll see an empty workspace labeled “Untitled-1” ⬝ This is a “profile” in DROID terms; it represents a set of data you’re working on ⬝ Add some files/folders to this profile using the Add (+) button
  • 11. Let DROID do its thing ⬝ Once you’ve added the files you want to analyze, click the Start button ⬝ When DROID is finished, you will see the columns in the profile populate with information
  • 12. We did it! ⬝ You can now save this profile and open it later using DROID without needing to analyze the data again ⬝ You can also use DROID’s handy features: ⬞ Export lets you save the analysis as a CSV spreadsheet ⬞ Filter lets you drill down if your dataset is large ⬞ Report offers a range of statistical reports that can be generated with the analysis results
  • 13. What’s the impact? ⬝ file and DROID are commonly used within the digital preservation scene ⬝ Institutional repositories often integrate with these tools ⬝ Data curators use these tools when ensuring quality and integrity ⬝ Software package including Archivematica, FITS, and the FCLA Description Service integrate with these tools out-of-the-box
  • 14. Other tools to be aware of ⬝ JHOVE (and JHOVE2) ⬞ Determines whether data of a known format is valid ⬝ FITS (File Information Tool Set) ⬞ Analyzes data using a wide range of tools (including DROID and JHOVE) to look at it from every angle ⬝ FCLA Description Service ⬞ Web-based application that analyzes a single file using DROID and JHOVE and produces a PREMIS XML document containing the results
  • 15. Links to project web sites DROID:_http://nationalarchives.gov. uk/information-management/projects-andwork/droid.htm JHOVE: http://jhove.sourceforge.net/ JHOVE2: https://bitbucket.org/jhove2/main/wiki FITS: http://fitstool.org Description Service: http://description.fcla.edu/ And quick primer on all of these tools: http://metaarchive.org/imls/index.php/Format_Recognition_Tools_Documentation_for_ETDs Thanks for attending!