Soumettre la recherche
Mettre en ligne
Content extraction with apache tika
•
Télécharger en tant que KEY, PDF
•
11 j'aime
•
16,570 vues
Jukka Zitting
Suivre
Signaler
Partager
Signaler
Partager
1 sur 21
Télécharger maintenant
Recommandé
DCSF19 Dockerfile Best Practices
DCSF19 Dockerfile Best Practices
Docker, Inc.
Building Repeatable Infrastructure using Terraform
Building Repeatable Infrastructure using Terraform
Jeeva Chelladhurai
Red Hat OpenShift on Bare Metal and Containerized Storage
Red Hat OpenShift on Bare Metal and Containerized Storage
Greg Hoelzer
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
Arvind Kumar G.S
Comprehensive Terraform Training
Comprehensive Terraform Training
Yevgeniy Brikman
Helm 3
Helm 3
Matthew Farina
Git 101: Git and GitHub for Beginners
Git 101: Git and GitHub for Beginners
HubSpot
ISTIO Deep Dive
ISTIO Deep Dive
Yong Feng
Recommandé
DCSF19 Dockerfile Best Practices
DCSF19 Dockerfile Best Practices
Docker, Inc.
Building Repeatable Infrastructure using Terraform
Building Repeatable Infrastructure using Terraform
Jeeva Chelladhurai
Red Hat OpenShift on Bare Metal and Containerized Storage
Red Hat OpenShift on Bare Metal and Containerized Storage
Greg Hoelzer
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
Arvind Kumar G.S
Comprehensive Terraform Training
Comprehensive Terraform Training
Yevgeniy Brikman
Helm 3
Helm 3
Matthew Farina
Git 101: Git and GitHub for Beginners
Git 101: Git and GitHub for Beginners
HubSpot
ISTIO Deep Dive
ISTIO Deep Dive
Yong Feng
Deep Dive - Infrastructure as Code
Deep Dive - Infrastructure as Code
Amazon Web Services
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)
camunda services GmbH
12-Factor Apps
12-Factor Apps
Siva Rama Krishna Chunduru
Metasploit for information gathering
Metasploit for information gathering
Chris Harrington
Docker Container
Docker Container
Seung-Hoon Baek
DevOps Monitoring and Alerting
DevOps Monitoring and Alerting
Khairul Zebua
Introduction to DevOps
Introduction to DevOps
Hawkman Academy
Introduction to Kubernetes Workshop
Introduction to Kubernetes Workshop
Bob Killen
Terraform -- Infrastructure as Code
Terraform -- Infrastructure as Code
Martin Schütte
Introduction to helm
Introduction to helm
Jeeva Chelladhurai
Git
Git
Maks Charuk
Terraform Basics
Terraform Basics
Mohammed Fazuluddin
ELK Stack
ELK Stack
Eberhard Wolff
OpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
OpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
María Angélica Bracho
MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?
MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?
VMware Tanzu Korea
Architecture of Facebook
Architecture of Facebook
Syed Bahadur Shah
Docker & kubernetes
Docker & kubernetes
NexThoughts Technologies
DevOps: What, who, why and how?
DevOps: What, who, why and how?
Red Gate Software
Docker Networking Overview
Docker Networking Overview
Sreenivas Makam
Containers and Docker
Containers and Docker
Damian T. Gordon
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
Alfresco Software
Securing docker containers
Securing docker containers
Mihir Shah
Contenu connexe
Tendances
Deep Dive - Infrastructure as Code
Deep Dive - Infrastructure as Code
Amazon Web Services
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)
camunda services GmbH
12-Factor Apps
12-Factor Apps
Siva Rama Krishna Chunduru
Metasploit for information gathering
Metasploit for information gathering
Chris Harrington
Docker Container
Docker Container
Seung-Hoon Baek
DevOps Monitoring and Alerting
DevOps Monitoring and Alerting
Khairul Zebua
Introduction to DevOps
Introduction to DevOps
Hawkman Academy
Introduction to Kubernetes Workshop
Introduction to Kubernetes Workshop
Bob Killen
Terraform -- Infrastructure as Code
Terraform -- Infrastructure as Code
Martin Schütte
Introduction to helm
Introduction to helm
Jeeva Chelladhurai
Git
Git
Maks Charuk
Terraform Basics
Terraform Basics
Mohammed Fazuluddin
ELK Stack
ELK Stack
Eberhard Wolff
OpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
OpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
María Angélica Bracho
MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?
MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?
VMware Tanzu Korea
Architecture of Facebook
Architecture of Facebook
Syed Bahadur Shah
Docker & kubernetes
Docker & kubernetes
NexThoughts Technologies
DevOps: What, who, why and how?
DevOps: What, who, why and how?
Red Gate Software
Docker Networking Overview
Docker Networking Overview
Sreenivas Makam
Containers and Docker
Containers and Docker
Damian T. Gordon
Tendances
(20)
Deep Dive - Infrastructure as Code
Deep Dive - Infrastructure as Code
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)
12-Factor Apps
12-Factor Apps
Metasploit for information gathering
Metasploit for information gathering
Docker Container
Docker Container
DevOps Monitoring and Alerting
DevOps Monitoring and Alerting
Introduction to DevOps
Introduction to DevOps
Introduction to Kubernetes Workshop
Introduction to Kubernetes Workshop
Terraform -- Infrastructure as Code
Terraform -- Infrastructure as Code
Introduction to helm
Introduction to helm
Git
Git
Terraform Basics
Terraform Basics
ELK Stack
ELK Stack
OpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
OpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?
MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?
Architecture of Facebook
Architecture of Facebook
Docker & kubernetes
Docker & kubernetes
DevOps: What, who, why and how?
DevOps: What, who, why and how?
Docker Networking Overview
Docker Networking Overview
Containers and Docker
Containers and Docker
Similaire à Content extraction with apache tika
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
Alfresco Software
Securing docker containers
Securing docker containers
Mihir Shah
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
Mikael Nilsson
DataCite How To: Use the MDS
DataCite How To: Use the MDS
Frauke Ziedorn
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Amazon Web Services
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
buildacloud
Open writing-cloud-collab
Open writing-cloud-collab
Karen Vuong
Multi Stage Docker Build
Multi Stage Docker Build
Prasenjit Sarkar
People aggregator
People aggregator
Huntor Group
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent Archive
Phil Cryer
Core os dna_automacon
Core os dna_automacon
Patrick Galbraith
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Marco Garcia
What is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna Nookella
muralikrishnanookella
Hadoop and object stores can we do it better
Hadoop and object stores can we do it better
gvernik
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
gvernik
Understanding information content with apache tika
Understanding information content with apache tika
Sutthipong Kuruhongsa
Understanding information content with apache tika
Understanding information content with apache tika
Sutthipong Kuruhongsa
Cloudera and Spark setup
Cloudera and Spark setup
Sumit Mendiratta
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Mark Wilkinson
AWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic Beanstalk
Amazon Web Services
Similaire à Content extraction with apache tika
(20)
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
Securing docker containers
Securing docker containers
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DataCite How To: Use the MDS
DataCite How To: Use the MDS
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open writing-cloud-collab
Open writing-cloud-collab
Multi Stage Docker Build
Multi Stage Docker Build
People aggregator
People aggregator
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent Archive
Core os dna_automacon
Core os dna_automacon
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigData
What is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna Nookella
Hadoop and object stores can we do it better
Hadoop and object stores can we do it better
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
Understanding information content with apache tika
Understanding information content with apache tika
Understanding information content with apache tika
Understanding information content with apache tika
Cloudera and Spark setup
Cloudera and Spark setup
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, Oxford
AWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic Beanstalk
Plus de Jukka Zitting
The new repository in AEM 6
The new repository in AEM 6
Jukka Zitting
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CI
Jukka Zitting
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3
Jukka Zitting
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository
Jukka Zitting
MicroKernel & NodeStore
MicroKernel & NodeStore
Jukka Zitting
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
Jukka Zitting
Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011
Jukka Zitting
OSGifying the repository
OSGifying the repository
Jukka Zitting
Repository performance tuning
Repository performance tuning
Jukka Zitting
The return of the hierarchical model
The return of the hierarchical model
Jukka Zitting
Text and metadata extraction with Apache Tika
Text and metadata extraction with Apache Tika
Jukka Zitting
Mime Magic With Apache Tika
Mime Magic With Apache Tika
Jukka Zitting
NoSQL Oakland
NoSQL Oakland
Jukka Zitting
Content Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
Jukka Zitting
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache Jackrabbi
Jukka Zitting
File System On Steroids
File System On Steroids
Jukka Zitting
Mime Magic With Apache Tika
Mime Magic With Apache Tika
Jukka Zitting
Design and architecture of Jackrabbit
Design and architecture of Jackrabbit
Jukka Zitting
Apache Tika
Apache Tika
Jukka Zitting
Content Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
Jukka Zitting
Plus de Jukka Zitting
(20)
The new repository in AEM 6
The new repository in AEM 6
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CI
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository
MicroKernel & NodeStore
MicroKernel & NodeStore
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011
OSGifying the repository
OSGifying the repository
Repository performance tuning
Repository performance tuning
The return of the hierarchical model
The return of the hierarchical model
Text and metadata extraction with Apache Tika
Text and metadata extraction with Apache Tika
Mime Magic With Apache Tika
Mime Magic With Apache Tika
NoSQL Oakland
NoSQL Oakland
Content Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache Jackrabbi
File System On Steroids
File System On Steroids
Mime Magic With Apache Tika
Mime Magic With Apache Tika
Design and architecture of Jackrabbit
Design and architecture of Jackrabbit
Apache Tika
Apache Tika
Content Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
Content extraction with apache tika
1.
Content Extraction with
Apache Tika Jukka Zitting | Tika committer, co-author of Tika in Action © 2012 Adobe Systems Incorporated. All Rights Reserved.
2.
Content Extraction with
Apache Tika Introduction to Apache Tika Full text extraction with Tika Tika and Solr - the ExtractingRequestHandler Tika and Lucene - direct feeding of the index forked parsing link extraction © 2012 Adobe Systems Incorporated. All Rights Reserved. 2
3.
Introduction to Apache
Tika section 1 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
4.
Introduction to Apache
Tika The Apache Tika™ toolkit - detects and extracts - metadata and structured text content - from various documents - using existing parser libraries. © 2012 Adobe Systems Incorporated. All Rights Reserved. 4
5.
Problem domain © 2012
Adobe Systems Incorporated. All Rights Reserved. 5
6.
The Tika solution
It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife... Content dc:title= Pride and Prejudice dc:creator= Jane Austen Document dc:date=1813 Metadata © 2012 Adobe Systems Incorporated. All Rights Reserved. 6
7.
Project background
Brief history 2007 Tika started in the Apache Incubator 2008 Tika graduates into a Lucene subproject 2010 Tika becomes a standalone TLP 2011 Tika 1.0 released 2011 Tika in Action published Latest release is Apache Tika 1.2 thousands of known media types most with associated type detection patterns dozens of supported document formats including all major office formats basic language detection etc. For more information http://tika.apache.org/ © 2012 Adobe Systems Incorporated. All Rights Reserved. 7
8.
Ohloh summary (http://www.ohloh.net/p/tika) ©
2012 Adobe Systems Incorporated. All Rights Reserved. 8
9.
Full text extraction
with Tika section 2 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
10.
Demo: tika-app-1.2.jar
https://github.com/jukka/tika-demo java -jar tika-app-1.2.jar © 2012 Adobe Systems Incorporated. All Rights Reserved. 10
11.
tika-app as a
command line tool $ java -jar tika-app-1.2.jar --xhtml /path/to/ document.doc $ java -jar tika-app-1.2.jar --text http://example.com/ document $ java -jar tika-app-1.2.jar --metadata < document.doc $ cat document.doc | java -jar tika-app-1.2.jar --text | grep foo $ java -jar tika-app-1.2.jar --help © 2012 Adobe Systems Incorporated. All Rights Reserved. 11
12.
Tika’s Java API
Divided in two layers The Tika facade: org.apache.tika.Tika Lower-level interfaces like Parser, Detector, etc. Use the Tika facade by default Provides simple support for most common use cases Example: new Tika().parseToString(“/path/to/document.doc”) Use the lower-level interfaces for more power or flexibility Allows fine-grained control of Tika functionality More complicated programming model Parsed content handled as XHTML SAX events Not all functionality is exposed through the Tika facade © 2012 Adobe Systems Incorporated. All Rights Reserved. 12
13.
Tika and Solr
- the ExtractingRequestHandler section 3 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
14.
ExtractingRequestHandler
aka Solr Cell http://wiki.apache.org/solr/ExtractingRequestHandler “Solr's ExtractingRequestHandler uses Tika to allow users to upload binary files to Solr and have Solr extract text from it and then index it.” For example: $ curl "http://localhost:8983/solr/update/extract? literal.id=document&commit=true" -F "file=@document.doc" Supports both text and metadata extraction with plenty of configurable options © 2012 Adobe Systems Incorporated. All Rights Reserved. 14
15.
ExtractingRequestHandler parameters
Helping Tika do it’s job resource.name=document.doc - Helps Tika’s automatic type detection resource.password=secret - Allows Tika to read encrypted documents passwordsFile=/path/to/password-file - Resource name to password mappings for example: .*.pdf$ = pdf-secret Capturing special content xpath=//a - Capture only content inside elements that match the specified query capture=h1 - Capture content inside specific elements to a separate field captureAttr=true - Capture attributes into separate fields named after the element Mapping field names lowernames=true - Normalize metadata field names to “content_type”, etc. © 2012 Adobe Systems Incorporated. All Rights Reserved. 15
16.
Tika and Lucene
- direct feeding of the index section 4 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
17.
Using the Tika
facade to feed Lucene // Index first part of the document String text = new Tika().parseToString(“/path/to/document.doc”); document.add(new TextField(“text”, text, Field.Store.NO)); // Index the full document Reader reader = new Tika().parse(“/path/to/document.doc”); document.add(new TextField(“text”, reader)); // Index also some metadata Metadata metadata = new Metadata(); Reader reader = new Tika().parse( new FileInputStream(“/path/to/document.doc”), metadata); document.add(new TextField(“text”, reader)); document.add(new StringField(“type”, metadata.get(Metadata.CONTENT_TYPE)); © 2012 Adobe Systems Incorporated. All Rights Reserved. 17
18.
Things to consider
What if the document is larger than your memory? Index only first N bytes/characters? WriteOutContentHandler supports an explicit write limit Enabled by default in the Tika facade, see get/setMaxStringLength() What if the document is malformed or intentionally broken? Could cause denial of service problems Might even crash the entire JVM due to bugs in native libraries in the JDK! SecureContentHandler monitors parsing and terminates it if things look bad Enabled by default in the Tika facade Ultimate solution: forked parsing and the Tika server Parse documents in separate, sandboxed JVM processes A document could fail to parse, but your application won’t crash Code is already there, but still a bit tricky to set up © 2012 Adobe Systems Incorporated. All Rights Reserved. 18
19.
Link extraction for
web crawlers The LinkContentHandler class can be used to extract all links from a document Works also with links in things like PDF, MS Word and email documents Use TeeContentHandler to combine with other ways of capturing content // for example LinkContentHandler lch = new LinkContentHandler(); BodyContentHandler bch = new BodyContentHandler(); new Tika().getParser().parse(..., new TeeContentHandler(lch, bch), ...); System.out.println(“Content: “ + bch); for (Link link : lch.getLinks()) { System.out.println(“Link: “ + link): } © 2012 Adobe Systems Incorporated. All Rights Reserved. 19
20.
Questions?
http://tika.apache.org/ © 2012 Adobe Systems Incorporated. All Rights Reserved.
21.
© 2012 Adobe
Systems Incorporated. All Rights Reserved.
Notes de l'éditeur
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
Télécharger maintenant