SlideShare une entreprise Scribd logo
1  sur  21
Content Extraction with Apache Tika
     Jukka Zitting | Tika committer, co-author of Tika in Action




© 2012 Adobe Systems Incorporated. All Rights Reserved.
Content Extraction with Apache Tika

    Introduction to Apache Tika
    Full text extraction with Tika
    Tika and Solr - the ExtractingRequestHandler
    Tika and Lucene - direct feeding of the index
         forked parsing
         link extraction




© 2012 Adobe Systems Incorporated. All Rights Reserved.   2
Introduction to Apache Tika
                                                          section 1 / 4




© 2012 Adobe Systems Incorporated. All Rights Reserved.
Introduction to Apache Tika




                          The Apache Tika™ toolkit
                          - detects and extracts
                          - metadata and structured text content
                          - from various documents
                          - using existing parser libraries.

© 2012 Adobe Systems Incorporated. All Rights Reserved.   4
Problem domain




© 2012 Adobe Systems Incorporated. All Rights Reserved.   5
The Tika solution



                                                              It is a truth
                                                              universally
                                                              acknowledged, that
                                                              a single man in
                                                              possession of a
                                                              good fortune, must
                                                              be in want of a
                                                              wife...
                                                              Content
                                                              dc:title=
                                                               Pride and Prejudice
                                                              dc:creator=
                                                               Jane Austen
         Document                                             dc:date=1813

                                                              Metadata




© 2012 Adobe Systems Incorporated. All Rights Reserved.   6
Project background

    Brief history
         2007           Tika       started in the Apache Incubator
         2008           Tika       graduates into a Lucene subproject
         2010           Tika       becomes a standalone TLP
         2011           Tika       1.0 released
         2011           Tika       in Action published
    Latest release is Apache Tika 1.2
         thousands of known media types
             most with associated type detection patterns
         dozens of supported document formats
             including all major office formats
         basic language detection
         etc.
    For more information http://tika.apache.org/


© 2012 Adobe Systems Incorporated. All Rights Reserved.   7
Ohloh summary (http://www.ohloh.net/p/tika)




© 2012 Adobe Systems Incorporated. All Rights Reserved.   8
Full text extraction with Tika
                                                          section 2 / 4




© 2012 Adobe Systems Incorporated. All Rights Reserved.
Demo: tika-app-1.2.jar

    https://github.com/jukka/tika-demo
    java -jar tika-app-1.2.jar




© 2012 Adobe Systems Incorporated. All Rights Reserved.   10
tika-app as a command line tool

$ java -jar tika-app-1.2.jar --xhtml /path/to/
document.doc

$ java -jar tika-app-1.2.jar --text http://example.com/
 document

$ java -jar tika-app-1.2.jar --metadata < document.doc

$ cat document.doc | java -jar tika-app-1.2.jar --text |
 grep foo

$ java -jar tika-app-1.2.jar --help



© 2012 Adobe Systems Incorporated. All Rights Reserved.   11
Tika’s Java API

    Divided in two layers
         The Tika facade: org.apache.tika.Tika
         Lower-level interfaces like Parser, Detector, etc.

    Use the Tika facade by default
         Provides simple support for most common use cases
         Example: new Tika().parseToString(“/path/to/document.doc”)

    Use the lower-level interfaces for more power or flexibility
         Allows fine-grained control of Tika functionality
         More complicated programming model
             Parsed content handled as XHTML SAX events
         Not all functionality is exposed through the Tika facade




© 2012 Adobe Systems Incorporated. All Rights Reserved.   12
Tika and Solr - the ExtractingRequestHandler
                                                          section 3 / 4




© 2012 Adobe Systems Incorporated. All Rights Reserved.
ExtractingRequestHandler

    aka Solr Cell
    http://wiki.apache.org/solr/ExtractingRequestHandler

     “Solr's ExtractingRequestHandler uses Tika
     to allow users to upload binary files to Solr and
     have Solr extract text from it and then index it.”

    For example:
     $ curl "http://localhost:8983/solr/update/extract?
     literal.id=document&commit=true" -F "file=@document.doc"

    Supports both text and metadata extraction
         with plenty of configurable options




© 2012 Adobe Systems Incorporated. All Rights Reserved.   14
ExtractingRequestHandler parameters

    Helping Tika do it’s job
         resource.name=document.doc - Helps Tika’s automatic type detection
         resource.password=secret - Allows Tika to read encrypted documents
         passwordsFile=/path/to/password-file - Resource name to password
          mappings
             for example: .*.pdf$ = pdf-secret
    Capturing special content
         xpath=//a - Capture only content inside elements that match the
          specified query
         capture=h1 - Capture content inside specific elements to a separate
          field
         captureAttr=true - Capture attributes into separate fields named after
          the element
    Mapping field names
         lowernames=true - Normalize metadata field names to “content_type”,
          etc.

© 2012 Adobe Systems Incorporated. All Rights Reserved.   15
Tika and Lucene - direct feeding of the index
                                                          section 4 / 4




© 2012 Adobe Systems Incorporated. All Rights Reserved.
Using the Tika facade to feed Lucene

// Index first part of the document
String text = new Tika().parseToString(“/path/to/document.doc”);
document.add(new TextField(“text”, text, Field.Store.NO));


// Index the full document
Reader reader = new Tika().parse(“/path/to/document.doc”);
document.add(new TextField(“text”, reader));


// Index also some metadata
Metadata metadata = new Metadata();
Reader reader = new Tika().parse(
   new FileInputStream(“/path/to/document.doc”), metadata);
document.add(new TextField(“text”, reader));
document.add(new StringField(“type”,
metadata.get(Metadata.CONTENT_TYPE));

© 2012 Adobe Systems Incorporated. All Rights Reserved.   17
Things to consider

    What if the document is larger than your memory?
         Index only first N bytes/characters?
         WriteOutContentHandler supports an explicit write limit
             Enabled by default in the Tika facade, see get/setMaxStringLength()

    What if the document is malformed or intentionally broken?
         Could cause denial of service problems
             Might even crash the entire JVM due to bugs in native libraries in the JDK!
         SecureContentHandler monitors parsing and terminates it if things look
          bad
             Enabled by default in the Tika facade

    Ultimate solution: forked parsing and the Tika server
         Parse documents in separate, sandboxed JVM processes
         A document could fail to parse, but your application won’t crash
         Code is already there, but still a bit tricky to set up

© 2012 Adobe Systems Incorporated. All Rights Reserved.   18
Link extraction for web crawlers

    The LinkContentHandler class can be used to extract all links from
     a document
         Works also with links in things like PDF, MS Word and email documents
         Use TeeContentHandler to combine with other ways of capturing content

// for example
LinkContentHandler lch = new LinkContentHandler();
BodyContentHandler bch = new BodyContentHandler();
new Tika().getParser().parse(..., new TeeContentHandler(lch, bch), ...);

System.out.println(“Content: “ + bch);
for (Link link : lch.getLinks()) {
   System.out.println(“Link: “ + link):
}


© 2012 Adobe Systems Incorporated. All Rights Reserved.   19
Questions?
                                                          http://tika.apache.org/




© 2012 Adobe Systems Incorporated. All Rights Reserved.
© 2012 Adobe Systems Incorporated. All Rights Reserved.

Contenu connexe

Tendances

Deep Dive - Infrastructure as Code
Deep Dive - Infrastructure as CodeDeep Dive - Infrastructure as Code
Deep Dive - Infrastructure as CodeAmazon Web Services
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)camunda services GmbH
 
Metasploit for information gathering
Metasploit for information gatheringMetasploit for information gathering
Metasploit for information gatheringChris Harrington
 
DevOps Monitoring and Alerting
DevOps Monitoring and AlertingDevOps Monitoring and Alerting
DevOps Monitoring and AlertingKhairul Zebua
 
Introduction to Kubernetes Workshop
Introduction to Kubernetes WorkshopIntroduction to Kubernetes Workshop
Introduction to Kubernetes WorkshopBob Killen
 
Terraform -- Infrastructure as Code
Terraform -- Infrastructure as CodeTerraform -- Infrastructure as Code
Terraform -- Infrastructure as CodeMartin Schütte
 
OpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
OpenShift Meetup - Tokyo - Service Mesh and Serverless OverviewOpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
OpenShift Meetup - Tokyo - Service Mesh and Serverless OverviewMaría Angélica Bracho
 
MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?
MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?
MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?VMware Tanzu Korea
 
DevOps: What, who, why and how?
DevOps: What, who, why and how?DevOps: What, who, why and how?
DevOps: What, who, why and how?Red Gate Software
 
Docker Networking Overview
Docker Networking OverviewDocker Networking Overview
Docker Networking OverviewSreenivas Makam
 

Tendances (20)

Deep Dive - Infrastructure as Code
Deep Dive - Infrastructure as CodeDeep Dive - Infrastructure as Code
Deep Dive - Infrastructure as Code
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)
 
12-Factor Apps
12-Factor Apps12-Factor Apps
12-Factor Apps
 
Metasploit for information gathering
Metasploit for information gatheringMetasploit for information gathering
Metasploit for information gathering
 
Docker Container
Docker ContainerDocker Container
Docker Container
 
DevOps Monitoring and Alerting
DevOps Monitoring and AlertingDevOps Monitoring and Alerting
DevOps Monitoring and Alerting
 
Introduction to DevOps
Introduction to DevOpsIntroduction to DevOps
Introduction to DevOps
 
Introduction to Kubernetes Workshop
Introduction to Kubernetes WorkshopIntroduction to Kubernetes Workshop
Introduction to Kubernetes Workshop
 
Terraform -- Infrastructure as Code
Terraform -- Infrastructure as CodeTerraform -- Infrastructure as Code
Terraform -- Infrastructure as Code
 
Introduction to helm
Introduction to helmIntroduction to helm
Introduction to helm
 
Git
GitGit
Git
 
Terraform Basics
Terraform BasicsTerraform Basics
Terraform Basics
 
ELK Stack
ELK StackELK Stack
ELK Stack
 
OpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
OpenShift Meetup - Tokyo - Service Mesh and Serverless OverviewOpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
OpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
 
MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?
MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?
MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?
 
Architecture of Facebook
Architecture of FacebookArchitecture of Facebook
Architecture of Facebook
 
Docker & kubernetes
Docker & kubernetesDocker & kubernetes
Docker & kubernetes
 
DevOps: What, who, why and how?
DevOps: What, who, why and how?DevOps: What, who, why and how?
DevOps: What, who, why and how?
 
Docker Networking Overview
Docker Networking OverviewDocker Networking Overview
Docker Networking Overview
 
Containers and Docker
Containers and DockerContainers and Docker
Containers and Docker
 

Similaire à Content extraction with apache tika

PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationAlfresco Software
 
Securing docker containers
Securing docker containersSecuring docker containers
Securing docker containersMihir Shah
 
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemasDC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemasMikael Nilsson
 
DataCite How To: Use the MDS
DataCite How To: Use the MDSDataCite How To: Use the MDS
DataCite How To: Use the MDSFrauke Ziedorn
 
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...Amazon Web Services
 
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...buildacloud
 
Open writing-cloud-collab
Open writing-cloud-collabOpen writing-cloud-collab
Open writing-cloud-collabKaren Vuong
 
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchiveUsing Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchivePhil Cryer
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataMarco Garcia
 
What is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna NookellaWhat is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna Nookellamuralikrishnanookella
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it bettergvernik
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?gvernik
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tikaSutthipong Kuruhongsa
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tikaSutthipong Kuruhongsa
 
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordForce11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordMark Wilkinson
 
AWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic BeanstalkAWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic BeanstalkAmazon Web Services
 

Similaire à Content extraction with apache tika (20)

PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
 
Securing docker containers
Securing docker containersSecuring docker containers
Securing docker containers
 
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemasDC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
 
DataCite How To: Use the MDS
DataCite How To: Use the MDSDataCite How To: Use the MDS
DataCite How To: Use the MDS
 
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
 
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
 
Open writing-cloud-collab
Open writing-cloud-collabOpen writing-cloud-collab
Open writing-cloud-collab
 
Multi Stage Docker Build
Multi Stage Docker Build Multi Stage Docker Build
Multi Stage Docker Build
 
People aggregator
People aggregatorPeople aggregator
People aggregator
 
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchiveUsing Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent Archive
 
Core os dna_automacon
Core os dna_automaconCore os dna_automacon
Core os dna_automacon
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigData
 
What is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna NookellaWhat is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna Nookella
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Cloudera and Spark setup
Cloudera and Spark setupCloudera and Spark setup
Cloudera and Spark setup
 
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordForce11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, Oxford
 
AWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic BeanstalkAWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic Beanstalk
 

Plus de Jukka Zitting

The new repository in AEM 6
The new repository in AEM 6The new repository in AEM 6
The new repository in AEM 6Jukka Zitting
 
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CIApache development with GitHub and Travis CI
Apache development with GitHub and Travis CIJukka Zitting
 
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Jukka Zitting
 
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repositoryJukka Zitting
 
MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStoreJukka Zitting
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorJukka Zitting
 
Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Jukka Zitting
 
OSGifying the repository
OSGifying the repositoryOSGifying the repository
OSGifying the repositoryJukka Zitting
 
Repository performance tuning
Repository performance tuningRepository performance tuning
Repository performance tuningJukka Zitting
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical modelJukka Zitting
 
Text and metadata extraction with Apache Tika
Text and metadata extraction with Apache TikaText and metadata extraction with Apache Tika
Text and metadata extraction with Apache TikaJukka Zitting
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache TikaJukka Zitting
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitJukka Zitting
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiJukka Zitting
 
File System On Steroids
File System On SteroidsFile System On Steroids
File System On SteroidsJukka Zitting
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache TikaJukka Zitting
 
Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of JackrabbitJukka Zitting
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache JackrabbitJukka Zitting
 

Plus de Jukka Zitting (20)

The new repository in AEM 6
The new repository in AEM 6The new repository in AEM 6
The new repository in AEM 6
 
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CIApache development with GitHub and Travis CI
Apache development with GitHub and Travis CI
 
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3
 
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository
 
MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStore
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
 
Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011
 
OSGifying the repository
OSGifying the repositoryOSGifying the repository
OSGifying the repository
 
Repository performance tuning
Repository performance tuningRepository performance tuning
Repository performance tuning
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical model
 
Text and metadata extraction with Apache Tika
Text and metadata extraction with Apache TikaText and metadata extraction with Apache Tika
Text and metadata extraction with Apache Tika
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
NoSQL Oakland
NoSQL OaklandNoSQL Oakland
NoSQL Oakland
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache Jackrabbi
 
File System On Steroids
File System On SteroidsFile System On Steroids
File System On Steroids
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of Jackrabbit
 
Apache Tika
Apache TikaApache Tika
Apache Tika
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
 

Content extraction with apache tika

  • 1. Content Extraction with Apache Tika Jukka Zitting | Tika committer, co-author of Tika in Action © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 2. Content Extraction with Apache Tika  Introduction to Apache Tika  Full text extraction with Tika  Tika and Solr - the ExtractingRequestHandler  Tika and Lucene - direct feeding of the index  forked parsing  link extraction © 2012 Adobe Systems Incorporated. All Rights Reserved. 2
  • 3. Introduction to Apache Tika section 1 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 4. Introduction to Apache Tika The Apache Tika™ toolkit - detects and extracts - metadata and structured text content - from various documents - using existing parser libraries. © 2012 Adobe Systems Incorporated. All Rights Reserved. 4
  • 5. Problem domain © 2012 Adobe Systems Incorporated. All Rights Reserved. 5
  • 6. The Tika solution It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife... Content dc:title= Pride and Prejudice dc:creator= Jane Austen Document dc:date=1813 Metadata © 2012 Adobe Systems Incorporated. All Rights Reserved. 6
  • 7. Project background  Brief history  2007 Tika started in the Apache Incubator  2008 Tika graduates into a Lucene subproject  2010 Tika becomes a standalone TLP  2011 Tika 1.0 released  2011 Tika in Action published  Latest release is Apache Tika 1.2  thousands of known media types  most with associated type detection patterns  dozens of supported document formats  including all major office formats  basic language detection  etc.  For more information http://tika.apache.org/ © 2012 Adobe Systems Incorporated. All Rights Reserved. 7
  • 8. Ohloh summary (http://www.ohloh.net/p/tika) © 2012 Adobe Systems Incorporated. All Rights Reserved. 8
  • 9. Full text extraction with Tika section 2 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 10. Demo: tika-app-1.2.jar  https://github.com/jukka/tika-demo  java -jar tika-app-1.2.jar © 2012 Adobe Systems Incorporated. All Rights Reserved. 10
  • 11. tika-app as a command line tool $ java -jar tika-app-1.2.jar --xhtml /path/to/ document.doc $ java -jar tika-app-1.2.jar --text http://example.com/ document $ java -jar tika-app-1.2.jar --metadata < document.doc $ cat document.doc | java -jar tika-app-1.2.jar --text | grep foo $ java -jar tika-app-1.2.jar --help © 2012 Adobe Systems Incorporated. All Rights Reserved. 11
  • 12. Tika’s Java API  Divided in two layers  The Tika facade: org.apache.tika.Tika  Lower-level interfaces like Parser, Detector, etc.  Use the Tika facade by default  Provides simple support for most common use cases  Example: new Tika().parseToString(“/path/to/document.doc”)  Use the lower-level interfaces for more power or flexibility  Allows fine-grained control of Tika functionality  More complicated programming model  Parsed content handled as XHTML SAX events  Not all functionality is exposed through the Tika facade © 2012 Adobe Systems Incorporated. All Rights Reserved. 12
  • 13. Tika and Solr - the ExtractingRequestHandler section 3 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 14. ExtractingRequestHandler  aka Solr Cell  http://wiki.apache.org/solr/ExtractingRequestHandler “Solr's ExtractingRequestHandler uses Tika to allow users to upload binary files to Solr and have Solr extract text from it and then index it.”  For example: $ curl "http://localhost:8983/solr/update/extract? literal.id=document&commit=true" -F "file=@document.doc"  Supports both text and metadata extraction  with plenty of configurable options © 2012 Adobe Systems Incorporated. All Rights Reserved. 14
  • 15. ExtractingRequestHandler parameters  Helping Tika do it’s job  resource.name=document.doc - Helps Tika’s automatic type detection  resource.password=secret - Allows Tika to read encrypted documents  passwordsFile=/path/to/password-file - Resource name to password mappings  for example: .*.pdf$ = pdf-secret  Capturing special content  xpath=//a - Capture only content inside elements that match the specified query  capture=h1 - Capture content inside specific elements to a separate field  captureAttr=true - Capture attributes into separate fields named after the element  Mapping field names  lowernames=true - Normalize metadata field names to “content_type”, etc. © 2012 Adobe Systems Incorporated. All Rights Reserved. 15
  • 16. Tika and Lucene - direct feeding of the index section 4 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 17. Using the Tika facade to feed Lucene // Index first part of the document String text = new Tika().parseToString(“/path/to/document.doc”); document.add(new TextField(“text”, text, Field.Store.NO)); // Index the full document Reader reader = new Tika().parse(“/path/to/document.doc”); document.add(new TextField(“text”, reader)); // Index also some metadata Metadata metadata = new Metadata(); Reader reader = new Tika().parse( new FileInputStream(“/path/to/document.doc”), metadata); document.add(new TextField(“text”, reader)); document.add(new StringField(“type”, metadata.get(Metadata.CONTENT_TYPE)); © 2012 Adobe Systems Incorporated. All Rights Reserved. 17
  • 18. Things to consider  What if the document is larger than your memory?  Index only first N bytes/characters?  WriteOutContentHandler supports an explicit write limit  Enabled by default in the Tika facade, see get/setMaxStringLength()  What if the document is malformed or intentionally broken?  Could cause denial of service problems  Might even crash the entire JVM due to bugs in native libraries in the JDK!  SecureContentHandler monitors parsing and terminates it if things look bad  Enabled by default in the Tika facade  Ultimate solution: forked parsing and the Tika server  Parse documents in separate, sandboxed JVM processes  A document could fail to parse, but your application won’t crash  Code is already there, but still a bit tricky to set up © 2012 Adobe Systems Incorporated. All Rights Reserved. 18
  • 19. Link extraction for web crawlers  The LinkContentHandler class can be used to extract all links from a document  Works also with links in things like PDF, MS Word and email documents  Use TeeContentHandler to combine with other ways of capturing content // for example LinkContentHandler lch = new LinkContentHandler(); BodyContentHandler bch = new BodyContentHandler(); new Tika().getParser().parse(..., new TeeContentHandler(lch, bch), ...); System.out.println(“Content: “ + bch); for (Link link : lch.getLinks()) { System.out.println(“Link: “ + link): } © 2012 Adobe Systems Incorporated. All Rights Reserved. 19
  • 20. Questions? http://tika.apache.org/ © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 21. © 2012 Adobe Systems Incorporated. All Rights Reserved.

Notes de l'éditeur

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n