SlideShare une entreprise Scribd logo
1  sur  21
Content Extraction with Apache Tika
     Jukka Zitting | Tika committer, co-author of Tika in Action




© 2012 Adobe Systems Incorporated. All Rights Reserved.
Content Extraction with Apache Tika

    Introduction to Apache Tika
    Full text extraction with Tika
    Tika and Solr - the ExtractingRequestHandler
    Tika and Lucene - direct feeding of the index
         forked parsing
         link extraction




© 2012 Adobe Systems Incorporated. All Rights Reserved.   2
Introduction to Apache Tika
                                                          section 1 / 4




© 2012 Adobe Systems Incorporated. All Rights Reserved.
Introduction to Apache Tika




                          The Apache Tika™ toolkit
                          - detects and extracts
                          - metadata and structured text content
                          - from various documents
                          - using existing parser libraries.

© 2012 Adobe Systems Incorporated. All Rights Reserved.   4
Problem domain




© 2012 Adobe Systems Incorporated. All Rights Reserved.   5
The Tika solution



                                                              It is a truth
                                                              universally
                                                              acknowledged, that
                                                              a single man in
                                                              possession of a
                                                              good fortune, must
                                                              be in want of a
                                                              wife...
                                                              Content
                                                              dc:title=
                                                               Pride and Prejudice
                                                              dc:creator=
                                                               Jane Austen
         Document                                             dc:date=1813

                                                              Metadata




© 2012 Adobe Systems Incorporated. All Rights Reserved.   6
Project background

    Brief history
         2007           Tika       started in the Apache Incubator
         2008           Tika       graduates into a Lucene subproject
         2010           Tika       becomes a standalone TLP
         2011           Tika       1.0 released
         2011           Tika       in Action published
    Latest release is Apache Tika 1.2
         thousands of known media types
             most with associated type detection patterns
         dozens of supported document formats
             including all major office formats
         basic language detection
         etc.
    For more information http://tika.apache.org/


© 2012 Adobe Systems Incorporated. All Rights Reserved.   7
Ohloh summary (http://www.ohloh.net/p/tika)




© 2012 Adobe Systems Incorporated. All Rights Reserved.   8
Full text extraction with Tika
                                                          section 2 / 4




© 2012 Adobe Systems Incorporated. All Rights Reserved.
Demo: tika-app-1.2.jar

    https://github.com/jukka/tika-demo
    java -jar tika-app-1.2.jar




© 2012 Adobe Systems Incorporated. All Rights Reserved.   10
tika-app as a command line tool

$ java -jar tika-app-1.2.jar --xhtml /path/to/
document.doc

$ java -jar tika-app-1.2.jar --text http://example.com/
 document

$ java -jar tika-app-1.2.jar --metadata < document.doc

$ cat document.doc | java -jar tika-app-1.2.jar --text |
 grep foo

$ java -jar tika-app-1.2.jar --help



© 2012 Adobe Systems Incorporated. All Rights Reserved.   11
Tika’s Java API

    Divided in two layers
         The Tika facade: org.apache.tika.Tika
         Lower-level interfaces like Parser, Detector, etc.

    Use the Tika facade by default
         Provides simple support for most common use cases
         Example: new Tika().parseToString(“/path/to/document.doc”)

    Use the lower-level interfaces for more power or flexibility
         Allows fine-grained control of Tika functionality
         More complicated programming model
             Parsed content handled as XHTML SAX events
         Not all functionality is exposed through the Tika facade




© 2012 Adobe Systems Incorporated. All Rights Reserved.   12
Tika and Solr - the ExtractingRequestHandler
                                                          section 3 / 4




© 2012 Adobe Systems Incorporated. All Rights Reserved.
ExtractingRequestHandler

    aka Solr Cell
    http://wiki.apache.org/solr/ExtractingRequestHandler

     “Solr's ExtractingRequestHandler uses Tika
     to allow users to upload binary files to Solr and
     have Solr extract text from it and then index it.”

    For example:
     $ curl "http://localhost:8983/solr/update/extract?
     literal.id=document&commit=true" -F "file=@document.doc"

    Supports both text and metadata extraction
         with plenty of configurable options




© 2012 Adobe Systems Incorporated. All Rights Reserved.   14
ExtractingRequestHandler parameters

    Helping Tika do it’s job
         resource.name=document.doc - Helps Tika’s automatic type detection
         resource.password=secret - Allows Tika to read encrypted documents
         passwordsFile=/path/to/password-file - Resource name to password
          mappings
             for example: .*.pdf$ = pdf-secret
    Capturing special content
         xpath=//a - Capture only content inside elements that match the
          specified query
         capture=h1 - Capture content inside specific elements to a separate
          field
         captureAttr=true - Capture attributes into separate fields named after
          the element
    Mapping field names
         lowernames=true - Normalize metadata field names to “content_type”,
          etc.

© 2012 Adobe Systems Incorporated. All Rights Reserved.   15
Tika and Lucene - direct feeding of the index
                                                          section 4 / 4




© 2012 Adobe Systems Incorporated. All Rights Reserved.
Using the Tika facade to feed Lucene

// Index first part of the document
String text = new Tika().parseToString(“/path/to/document.doc”);
document.add(new TextField(“text”, text, Field.Store.NO));


// Index the full document
Reader reader = new Tika().parse(“/path/to/document.doc”);
document.add(new TextField(“text”, reader));


// Index also some metadata
Metadata metadata = new Metadata();
Reader reader = new Tika().parse(
   new FileInputStream(“/path/to/document.doc”), metadata);
document.add(new TextField(“text”, reader));
document.add(new StringField(“type”,
metadata.get(Metadata.CONTENT_TYPE));

© 2012 Adobe Systems Incorporated. All Rights Reserved.   17
Things to consider

    What if the document is larger than your memory?
         Index only first N bytes/characters?
         WriteOutContentHandler supports an explicit write limit
             Enabled by default in the Tika facade, see get/setMaxStringLength()

    What if the document is malformed or intentionally broken?
         Could cause denial of service problems
             Might even crash the entire JVM due to bugs in native libraries in the JDK!
         SecureContentHandler monitors parsing and terminates it if things look
          bad
             Enabled by default in the Tika facade

    Ultimate solution: forked parsing and the Tika server
         Parse documents in separate, sandboxed JVM processes
         A document could fail to parse, but your application won’t crash
         Code is already there, but still a bit tricky to set up

© 2012 Adobe Systems Incorporated. All Rights Reserved.   18
Link extraction for web crawlers

    The LinkContentHandler class can be used to extract all links from
     a document
         Works also with links in things like PDF, MS Word and email documents
         Use TeeContentHandler to combine with other ways of capturing content

// for example
LinkContentHandler lch = new LinkContentHandler();
BodyContentHandler bch = new BodyContentHandler();
new Tika().getParser().parse(..., new TeeContentHandler(lch, bch), ...);

System.out.println(“Content: “ + bch);
for (Link link : lch.getLinks()) {
   System.out.println(“Link: “ + link):
}


© 2012 Adobe Systems Incorporated. All Rights Reserved.   19
Questions?
                                                          http://tika.apache.org/




© 2012 Adobe Systems Incorporated. All Rights Reserved.
© 2012 Adobe Systems Incorporated. All Rights Reserved.

Contenu connexe

Tendances

Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache Jackrabbi
Jukka Zitting
 
Sling models by Justin Edelson
Sling models by Justin Edelson Sling models by Justin Edelson
Sling models by Justin Edelson
AEM HUB
 

Tendances (20)

Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
 
SpringDataJPA - 스프링 캠프
SpringDataJPA - 스프링 캠프SpringDataJPA - 스프링 캠프
SpringDataJPA - 스프링 캠프
 
AEM - Client Libraries
AEM - Client LibrariesAEM - Client Libraries
AEM - Client Libraries
 
#살아있다 #자프링외길12년차 #코프링2개월생존기
#살아있다 #자프링외길12년차 #코프링2개월생존기#살아있다 #자프링외길12년차 #코프링2개월생존기
#살아있다 #자프링외길12년차 #코프링2개월생존기
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache Jackrabbi
 
Mastering the Sling Rewriter
Mastering the Sling RewriterMastering the Sling Rewriter
Mastering the Sling Rewriter
 
Using PostgreSQL for Data Privacy
Using PostgreSQL for Data PrivacyUsing PostgreSQL for Data Privacy
Using PostgreSQL for Data Privacy
 
Get the Look and Feel You Want in Oracle APEX
Get the Look and Feel You Want in Oracle APEXGet the Look and Feel You Want in Oracle APEX
Get the Look and Feel You Want in Oracle APEX
 
ProxySQL and the Tricks Up Its Sleeve - Percona Live 2022.pdf
ProxySQL and the Tricks Up Its Sleeve - Percona Live 2022.pdfProxySQL and the Tricks Up Its Sleeve - Percona Live 2022.pdf
ProxySQL and the Tricks Up Its Sleeve - Percona Live 2022.pdf
 
Sling models by Justin Edelson
Sling models by Justin Edelson Sling models by Justin Edelson
Sling models by Justin Edelson
 
JDBC - JPA - Spring Data
JDBC - JPA - Spring DataJDBC - JPA - Spring Data
JDBC - JPA - Spring Data
 
PostgreSQL Materialized Views with Active Record
PostgreSQL Materialized Views with Active RecordPostgreSQL Materialized Views with Active Record
PostgreSQL Materialized Views with Active Record
 
RxJS Evolved
RxJS EvolvedRxJS Evolved
RxJS Evolved
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
Ksug2015 - JPA2, JPA 기초와매핑
Ksug2015 - JPA2, JPA 기초와매핑Ksug2015 - JPA2, JPA 기초와매핑
Ksug2015 - JPA2, JPA 기초와매핑
 
Storing 16 Bytes at Scale
Storing 16 Bytes at ScaleStoring 16 Bytes at Scale
Storing 16 Bytes at Scale
 
Discovering the 2 in Alfresco Search Services 2.0
Discovering the 2 in Alfresco Search Services 2.0Discovering the 2 in Alfresco Search Services 2.0
Discovering the 2 in Alfresco Search Services 2.0
 
Oracle Office Hours - Exposing REST services with APEX and ORDS
Oracle Office Hours - Exposing REST services with APEX and ORDSOracle Office Hours - Exposing REST services with APEX and ORDS
Oracle Office Hours - Exposing REST services with APEX and ORDS
 
Adobe AEM - From Eventing to Job Processing
Adobe AEM - From Eventing to Job ProcessingAdobe AEM - From Eventing to Job Processing
Adobe AEM - From Eventing to Job Processing
 
Distributed Locking in Kubernetes
Distributed Locking in KubernetesDistributed Locking in Kubernetes
Distributed Locking in Kubernetes
 

Similaire à Content extraction with apache tika

Open writing-cloud-collab
Open writing-cloud-collabOpen writing-cloud-collab
Open writing-cloud-collab
Karen Vuong
 

Similaire à Content extraction with apache tika (20)

PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
 
Securing docker containers
Securing docker containersSecuring docker containers
Securing docker containers
 
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemasDC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
 
DataCite How To: Use the MDS
DataCite How To: Use the MDSDataCite How To: Use the MDS
DataCite How To: Use the MDS
 
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
 
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
 
Open writing-cloud-collab
Open writing-cloud-collabOpen writing-cloud-collab
Open writing-cloud-collab
 
Multi Stage Docker Build
Multi Stage Docker Build Multi Stage Docker Build
Multi Stage Docker Build
 
People aggregator
People aggregatorPeople aggregator
People aggregator
 
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchiveUsing Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent Archive
 
Core os dna_automacon
Core os dna_automaconCore os dna_automacon
Core os dna_automacon
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigData
 
What is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna NookellaWhat is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna Nookella
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Cloudera and Spark setup
Cloudera and Spark setupCloudera and Spark setup
Cloudera and Spark setup
 
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordForce11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, Oxford
 
AWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic BeanstalkAWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic Beanstalk
 

Plus de Jukka Zitting

MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStore
Jukka Zitting
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
Jukka Zitting
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical model
Jukka Zitting
 
Text and metadata extraction with Apache Tika
Text and metadata extraction with Apache TikaText and metadata extraction with Apache Tika
Text and metadata extraction with Apache Tika
Jukka Zitting
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
Jukka Zitting
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
Jukka Zitting
 

Plus de Jukka Zitting (19)

The new repository in AEM 6
The new repository in AEM 6The new repository in AEM 6
The new repository in AEM 6
 
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CIApache development with GitHub and Travis CI
Apache development with GitHub and Travis CI
 
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3
 
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository
 
MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStore
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
 
Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011
 
OSGifying the repository
OSGifying the repositoryOSGifying the repository
OSGifying the repository
 
Repository performance tuning
Repository performance tuningRepository performance tuning
Repository performance tuning
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical model
 
Text and metadata extraction with Apache Tika
Text and metadata extraction with Apache TikaText and metadata extraction with Apache Tika
Text and metadata extraction with Apache Tika
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
NoSQL Oakland
NoSQL OaklandNoSQL Oakland
NoSQL Oakland
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
 
File System On Steroids
File System On SteroidsFile System On Steroids
File System On Steroids
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of Jackrabbit
 
Apache Tika
Apache TikaApache Tika
Apache Tika
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
 

Content extraction with apache tika

  • 1. Content Extraction with Apache Tika Jukka Zitting | Tika committer, co-author of Tika in Action © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 2. Content Extraction with Apache Tika  Introduction to Apache Tika  Full text extraction with Tika  Tika and Solr - the ExtractingRequestHandler  Tika and Lucene - direct feeding of the index  forked parsing  link extraction © 2012 Adobe Systems Incorporated. All Rights Reserved. 2
  • 3. Introduction to Apache Tika section 1 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 4. Introduction to Apache Tika The Apache Tika™ toolkit - detects and extracts - metadata and structured text content - from various documents - using existing parser libraries. © 2012 Adobe Systems Incorporated. All Rights Reserved. 4
  • 5. Problem domain © 2012 Adobe Systems Incorporated. All Rights Reserved. 5
  • 6. The Tika solution It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife... Content dc:title= Pride and Prejudice dc:creator= Jane Austen Document dc:date=1813 Metadata © 2012 Adobe Systems Incorporated. All Rights Reserved. 6
  • 7. Project background  Brief history  2007 Tika started in the Apache Incubator  2008 Tika graduates into a Lucene subproject  2010 Tika becomes a standalone TLP  2011 Tika 1.0 released  2011 Tika in Action published  Latest release is Apache Tika 1.2  thousands of known media types  most with associated type detection patterns  dozens of supported document formats  including all major office formats  basic language detection  etc.  For more information http://tika.apache.org/ © 2012 Adobe Systems Incorporated. All Rights Reserved. 7
  • 8. Ohloh summary (http://www.ohloh.net/p/tika) © 2012 Adobe Systems Incorporated. All Rights Reserved. 8
  • 9. Full text extraction with Tika section 2 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 10. Demo: tika-app-1.2.jar  https://github.com/jukka/tika-demo  java -jar tika-app-1.2.jar © 2012 Adobe Systems Incorporated. All Rights Reserved. 10
  • 11. tika-app as a command line tool $ java -jar tika-app-1.2.jar --xhtml /path/to/ document.doc $ java -jar tika-app-1.2.jar --text http://example.com/ document $ java -jar tika-app-1.2.jar --metadata < document.doc $ cat document.doc | java -jar tika-app-1.2.jar --text | grep foo $ java -jar tika-app-1.2.jar --help © 2012 Adobe Systems Incorporated. All Rights Reserved. 11
  • 12. Tika’s Java API  Divided in two layers  The Tika facade: org.apache.tika.Tika  Lower-level interfaces like Parser, Detector, etc.  Use the Tika facade by default  Provides simple support for most common use cases  Example: new Tika().parseToString(“/path/to/document.doc”)  Use the lower-level interfaces for more power or flexibility  Allows fine-grained control of Tika functionality  More complicated programming model  Parsed content handled as XHTML SAX events  Not all functionality is exposed through the Tika facade © 2012 Adobe Systems Incorporated. All Rights Reserved. 12
  • 13. Tika and Solr - the ExtractingRequestHandler section 3 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 14. ExtractingRequestHandler  aka Solr Cell  http://wiki.apache.org/solr/ExtractingRequestHandler “Solr's ExtractingRequestHandler uses Tika to allow users to upload binary files to Solr and have Solr extract text from it and then index it.”  For example: $ curl "http://localhost:8983/solr/update/extract? literal.id=document&commit=true" -F "file=@document.doc"  Supports both text and metadata extraction  with plenty of configurable options © 2012 Adobe Systems Incorporated. All Rights Reserved. 14
  • 15. ExtractingRequestHandler parameters  Helping Tika do it’s job  resource.name=document.doc - Helps Tika’s automatic type detection  resource.password=secret - Allows Tika to read encrypted documents  passwordsFile=/path/to/password-file - Resource name to password mappings  for example: .*.pdf$ = pdf-secret  Capturing special content  xpath=//a - Capture only content inside elements that match the specified query  capture=h1 - Capture content inside specific elements to a separate field  captureAttr=true - Capture attributes into separate fields named after the element  Mapping field names  lowernames=true - Normalize metadata field names to “content_type”, etc. © 2012 Adobe Systems Incorporated. All Rights Reserved. 15
  • 16. Tika and Lucene - direct feeding of the index section 4 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 17. Using the Tika facade to feed Lucene // Index first part of the document String text = new Tika().parseToString(“/path/to/document.doc”); document.add(new TextField(“text”, text, Field.Store.NO)); // Index the full document Reader reader = new Tika().parse(“/path/to/document.doc”); document.add(new TextField(“text”, reader)); // Index also some metadata Metadata metadata = new Metadata(); Reader reader = new Tika().parse( new FileInputStream(“/path/to/document.doc”), metadata); document.add(new TextField(“text”, reader)); document.add(new StringField(“type”, metadata.get(Metadata.CONTENT_TYPE)); © 2012 Adobe Systems Incorporated. All Rights Reserved. 17
  • 18. Things to consider  What if the document is larger than your memory?  Index only first N bytes/characters?  WriteOutContentHandler supports an explicit write limit  Enabled by default in the Tika facade, see get/setMaxStringLength()  What if the document is malformed or intentionally broken?  Could cause denial of service problems  Might even crash the entire JVM due to bugs in native libraries in the JDK!  SecureContentHandler monitors parsing and terminates it if things look bad  Enabled by default in the Tika facade  Ultimate solution: forked parsing and the Tika server  Parse documents in separate, sandboxed JVM processes  A document could fail to parse, but your application won’t crash  Code is already there, but still a bit tricky to set up © 2012 Adobe Systems Incorporated. All Rights Reserved. 18
  • 19. Link extraction for web crawlers  The LinkContentHandler class can be used to extract all links from a document  Works also with links in things like PDF, MS Word and email documents  Use TeeContentHandler to combine with other ways of capturing content // for example LinkContentHandler lch = new LinkContentHandler(); BodyContentHandler bch = new BodyContentHandler(); new Tika().getParser().parse(..., new TeeContentHandler(lch, bch), ...); System.out.println(“Content: “ + bch); for (Link link : lch.getLinks()) { System.out.println(“Link: “ + link): } © 2012 Adobe Systems Incorporated. All Rights Reserved. 19
  • 20. Questions? http://tika.apache.org/ © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 21. © 2012 Adobe Systems Incorporated. All Rights Reserved.

Notes de l'éditeur

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n