Scientific data curation and processing with Apache Tika
1. Scientific data curation and
processing with Apache Tika
Chris A. Mattmann
Senior Computer Scientist, NASA Jet Propulsion Laboratory
Adjunct Assistant Professor, Univ. of Southern California
Member, Apache Software Foundation
2. Roadmap
• 1st
part of the talk
– Why Tika?
– What is Tika?
– What are the current versions of Tika?
– What can it do?
• 2nd
part of the talk
– NASA Earth Science Data Systems
– Data System Needs and Requirements
– How does Tika help?
3. And you are?
• Apache Member involved in
– Tika (VP,PMC), Nutch (PMC), Incubator (PMC),
OODT (Mentor), SIS (Mentor), Lucy (Mentor) and
Gora (Champion)
• Architect/Developer at
NASA JPL in
Pasadena, CA
• Software
Architecture/Engineeri
ng Prof at USC
5. Proliferation of content types
available
• By some accounts, 16K to 51K content
types*
• What to do with content types?
– Parse them
• How?
• Extract their text and structure
– Index their metadata
• In an indexing technology like Lucene, Solr, or in
Google Appliance
– Identify what language they belong to
• Ngrams
*http://filext.com/
9. Goals
• Identify and classify file types
– MIME detection
• Glob pattern
– *.txt
– *.pdf
• URL
– http://…pdf
– ftp://myfile.txt
• Magic bytes
• Combination of
the above means
• Classification means
reaction can be targeted
10. is…
• A content analysis and detection toolkit
• A set of Java APIs providing MIME type
detection, language identification,
integration of various parsing libraries
• A rich Metadata API for representing
different Metadata models
• A command line interface to the
underlying Java code
• A GUI interface to the Java code
11. Tika’s (Brief) History
• Original idea for Tika came from Chris Mattmann
and Jerome Charron in 2006
• Proposed as Lucene sub-project
– Others interested, didn’t gain much traction
• Went the Incubator route in 2007 when Jukka
Zitting found that there was a need for Tika
capabilities in Apache Jackrabbit
– A Content Management System
• Graduated from the Incubator to Lucene sub-
project in 2008
• Graduated to Apache TLP in April 2010
• Over 90 issues shipping in latest release (0.8)
12. Community
• Mailing lists
– User: 153 peeps
– Dev: 114 peeps
• Committers/PMC
– 10 peeps
– Probably 5-6 active
• Releases
– 7 releases so far
– Working on 0.8
Credit: svnsearch.org
13. Getting started rapidly…like
now!
• Download Tika from:
– http://tika.apache.org/download.html
• Grab tika-app-0.7.jar
• alias tika “java –jar tika-app-0.7.jar”
• tika < somefile.doc > extracted-
text.xhtml
• tika –m < somefile.doc >
extracted.met
• Works on Windows too (alias only on UNIX)
14. Detecting MIME types from
Java
• String type = Tika.detect(…)
– java.io.InputStream
– java.io.File
– java.net.URL
– java.lang.String
15. Adding new MIME types
• Got XML?
• Based on freedesktop.org spec (loosely)
17. Third-party parsing libraries
• Most of the custom applications come with
software libraries and tools to read/write
these files
– Rather than re-invent the wheel, figure out a
way to take advantage of them
• Parsing text and structure is a difficult
problem
– Not all libraries parse text in equivalent
manners
– Some are faster than others
– Some are more reliable than others
20. Extraction of Metadata
• Important to follow common Metadata models
– Dublin Core – any electronic resource
– XMP – also general like Dublin Core
– Word Metadata – specific to .doc, .ppt, etc.
– EXIF – image related
• Lots of standards and models out there
– The use and extraction of common models allows for
content intercomparison
– All standardize mechanisms for searching
– You always know for X file type that field Y is there and of
type String or Int or Date
23. Metadata
• Metadata met = new Metadata();
//Dubiln Core
met.set(Metadata.FORMAT, “text/html”);
//multi-valued
met.set(Metadata.FORMAT, “text/plain”);
System.out.println(
met.getValues(Metadata.FORMAT));
• Other met models supported (HTTP
Headers, Word, Creative Commons, Climate
Forcast, etc.)
– New in Tika 0.8! run: tika --list-met-models
24. Methods for language
identification
• N-grams
– Method of detecting next character or
set of characters in a sequence
– Useful in determine whether small
snippets of text come from a particular
language, or character set
• Non-computational approaches
– Tagging
– Looking for common words or characters
25. Language Detection
• LanguageIdentifier lang =
new LanguageIdentifier(new
LanguageProfile(
FileUtils.readFileToString(new
File(filename))));
• System.out.println(lang.getLanguage());
• Uses Ngram analysis included with Tika
– Originating from Nutch
– Can be improved
26. Running Tika in GUI form
• tika --gui
<html xmlns:html=“…”>
<body>
…
</body>
</html>
27. Integrating Tika into your
App
• Maven
• Ant
• Eclipse
• It’s just a set of jars
– tika-core
– tika-parsers
– tika-app
– tika-bundle
tika-core
tika-parsers
tika-
app
tika-
bundle
28. Some really great stuff in 0.8
• Container aware detection and MIME
improvements
• “Drop in” Parsers
– Compressed RTF / TNEF / LZFU
parsing available via external plugin at
Github
• New Parsers
– RSS
– Scientific files: NetCDF, HDF
29. Improvements to Tika
• Adding more parsers for content
types
– Omnigraffle?
• Expanding ability to handle random
access file parsing
– Scientific data file formats, some work
on this
• Improving language and charset
detection
32. Context
• NASA develops science data processing systems
for multiple earth science missions
• These systems convert the instrument telemetry
delivered to earth from space into useful data for
scientific research
• Typical characteristics
– Remote sensing instruments that orbit the Earth multiple
times daily
– Data are acquired constantly
– Complex algorithms convert instrument measurements to
geophysical quantities
33. The Square Kilometer Array
• 1 sq. km of
antennas
• Never-before
seen
resolution
looking into
the sky
• 700 TB
– Per second!
34. NASA DESDynI Mission
• 16 TB/day
• Geographically distributed
• 10s of 1000s of jobs per day
• Tier 1 Earth Science Decadal Mission
35. Some Considerations
• Scale
– Data throughput rates
– # of data types
– # of metadata types
– # of users to send the data to
• Federation
– Must leave the data where it is
– Socio/Economic/Political
• Heterogeneity
– Technology, data formats, skills!
37. How are we building these
systems now? -Allow for
push/pull of data
over arbitrary
protocols
- Ingestion builds
std catalog and
archive
-Deliver product
metadata to
search, portal or
GIS
-Plug in arbitrary
met extractors
38. How are we building these
systems now? -Separation of
file management
from workflow
management
-Allow for
heterogeneous
computing
resources
-Easily integrate
PGEs
-Leverages same
ingestion crawler
39. What does this have to do
with Tika?
Metadata
Ext: TIKA!
Metadata
Ext: TIKA!
MIME
identification:
TIKA!
MIME
identification:
TIKA!
40. What does this have to do
with Tika?
Metadata
Ext: TIKA!
MIME
identification:
TIKA!
MIME
identification:
TIKA!
41. Science Data File Formats
• Hierarchical Data Format (HDF)
– http://www.hdfgroup.org
– Versions 4 and 5
– Lots of NASA data is in 4, newer NASA data in 5
– Encapsulates
• Observation (Scalars, Vectors, Matrices, NxMxZ…)
• Metadata (Summary info, date/time ranges, spatial
ranges)
– Custom readers/writers/APIs in many languages
• C/C++, Python, Java
42. Science Data File Formats
• network Common Data Form (netCDF)
– www.unidata.ucar.edu/software/netcdf/
– Versions 3 and 4
– Heavily used in DOE, NOAA, etc.
– Encapsulates
• Observation (Scalars, Vectors, Matrices, NxMxZ…)
• Metadata (Summary info, date/time ranges, spatial
ranges)
– Custom readers/writers/APIs in many languages
• C/C++, Python, Java
– Not Hierarchical representation: all flat
43. So how does it work?
• Ingestion
– Science data files, ancillary information from
other missions, etc., arrive in NetCDF or HDF
format
– Need to extract their met, catalog and archive
them, etc.
• Can now use Tika to do this! TIKA-399 and TIKA-
400 added this capability into the Apache trunk
• Processing
– Processors (PGEs) generate NetCDF and
HDF, must extract met, catalog and archive
44. Tool support
• Entire stacks of tools written around
these formats
– OPeNDAP, LAS, readers, writers, custom
NASA mission toolkits
– OGC
• WMS, WCS, etc.
– Unique, one of a kind software build around
these data file formats
• Apache can contribute strongly in this
area!
45. Besides processing science
files
• …Tika also helps with
• MIME identification
– Useful in remote file acquisition
– Useful in classification (catalog/archive) of
existing content
– Useful in crawling (see my Nutch talk)
• Language identification
– Can be useful when data is coming from
around the world, but need to quickly identify
whether or not we can process it
46. Big Goal
• More closely link OODT and Tika
– Add new parser to Tika
– Easily get OODT met extractor based on it
• Contribute back some features still baking
in OODT
– Configuration aspects of parsing
– File types and extensions for science data files
• Spatial
– Some work done in my CS572 class on spatial
parser for Tika – would be great to integrate
with Tika, OODT, SIS, and Solr
47. NASA Geo Challenges
• Sometimes the data isn’t annotated with lat and lon
– How to discover this?
• Even when the data
is annotated with
spatial information,
computation of e.g.,
bounding box around
the poles is difficult
• Efficiency and speed are difficult since data is at
scale
48. Alright, I’ll shut up now
• Any questions?
• THANK YOU!
– mattmann@apache.org
– @chrismattmann on Twitter
49. Acknowledgements
• Some Tika material inspired by Jukka
Zitting’s talks
– http://www.slideshare.net/jukka/text-and-
metadata-extraction-with-apache-tika
– http://www.slideshare.net/jukka/text-and-
metadata-extraction-with-apache-tika-
4427630
• NASA Jet Propulsion Laboratory
– OODT Team
50. Book
• Jukka and I are writing
a book on Tika
– Working on Chapters 8
and 9 of 15
• Early Access available
through MEAP
program
• http://manning.com/mattmann/