SlideShare une entreprise Scribd logo
1  sur  17
Apache Tika An extensible, configurable content analysis framework toolkit
Agenda The Problem The Solution The Project The Design
The Problem PDFBox Apache Poi Apache Xerces ICU4J NekoHTML etc. Lucene index
It’s Worse Than That Licensing Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming Processing of digital media ? ? ? ? ? ? ? ?
Agenda The Problem The Solution The Project The Design
The Solution: Technical ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The Solution: Legal / Social ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Agenda The Problem The Solution The Project The Design
Project Status ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Current Features ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Project Statistics
Codebase History Lius Nutch Lius Lite Tika textmining Jackrabbit Andy Clark Jukka Zitting Rida Benjelloun Chris Mattman Jerome Charron Sami Siren Bertrand Delacretaz Keith Bennett
Agenda The Problem The Solution The Project The Design
Content Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting new  PowerPointParser().parse(…);
Media Type Detection application/vnd.ms-powerpoint MimeTypes types = …; MimeType type = types.getMimeType(…); tika-mimetypes.xml /etc/magic mime.types ?
Combined Detection and Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting TXT PDF XML new  AutoDetectParser().parse(…); ?
Agenda The Problem The Solution The Project The Design Thank You!

Contenu connexe

Tendances

Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
WO Community
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
Swapnil & Patil
 
Everything You Always Wanted To Know About SFX ...
Everything You Always Wanted To Know About SFX ...Everything You Always Wanted To Know About SFX ...
Everything You Always Wanted To Know About SFX ...
Louise Penn
 

Tendances (20)

Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
 
Scientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache TikaScientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache Tika
 
Arakno
AraknoArakno
Arakno
 
Lucene
LuceneLucene
Lucene
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
S4
S4S4
S4
 
Fedora Commons in the CLARIN Infrastructure
Fedora Commons in the CLARIN InfrastructureFedora Commons in the CLARIN Infrastructure
Fedora Commons in the CLARIN Infrastructure
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Web search engines
Web search enginesWeb search engines
Web search engines
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword
 
Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
 
Everything You Always Wanted To Know About SFX ...
Everything You Always Wanted To Know About SFX ...Everything You Always Wanted To Know About SFX ...
Everything You Always Wanted To Know About SFX ...
 

En vedette

Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
mteutelink
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
Manish kumar
 

En vedette (20)

Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
Drupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsquedaDrupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsqueda
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Mejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache SolrMejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache Solr
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
Alfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en españolAlfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en español
 
Search engine
Search engineSearch engine
Search engine
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
 
Introducción a Solr
Introducción a SolrIntroducción a Solr
Introducción a Solr
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
 
Conferencia 4: Queries
Conferencia 4: QueriesConferencia 4: Queries
Conferencia 4: Queries
 
Conferencia 3: solrconfig.xml
Conferencia 3: solrconfig.xmlConferencia 3: solrconfig.xml
Conferencia 3: solrconfig.xml
 
Conferencia 5: Extendiendo Solr
Conferencia 5: Extendiendo SolrConferencia 5: Extendiendo Solr
Conferencia 5: Extendiendo Solr
 

Similaire à Apache Tika

Zaven Akopov (DESY -L-) For the INSPIRE Collaboration DESY ...
Zaven Akopov (DESY -L-) For the INSPIRE Collaboration DESY ...Zaven Akopov (DESY -L-) For the INSPIRE Collaboration DESY ...
Zaven Akopov (DESY -L-) For the INSPIRE Collaboration DESY ...
Zaven Hakopov
 
Fedora Overview
Fedora OverviewFedora Overview
Fedora Overview
eposthumus
 
Hypatia for dlf 2011
Hypatia for dlf 2011Hypatia for dlf 2011
Hypatia for dlf 2011
DLFCLIR
 

Similaire à Apache Tika (20)

Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
Autopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics ConferenceAutopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics Conference
 
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian Carrier
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian CarrierOSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian Carrier
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian Carrier
 
Getting Ready for Project Cortex and SharePoint Syntex
Getting Ready for Project Cortex and SharePoint SyntexGetting Ready for Project Cortex and SharePoint Syntex
Getting Ready for Project Cortex and SharePoint Syntex
 
Flex vs HTML5
Flex vs HTML5Flex vs HTML5
Flex vs HTML5
 
Schema.org Update at ISWC2012
Schema.org Update at ISWC2012Schema.org Update at ISWC2012
Schema.org Update at ISWC2012
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
Getting Ready for Project Cortex and SharePoint Syntex
Getting Ready for Project Cortex and SharePoint SyntexGetting Ready for Project Cortex and SharePoint Syntex
Getting Ready for Project Cortex and SharePoint Syntex
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
Microsoft SharePoint Syntex
Microsoft SharePoint SyntexMicrosoft SharePoint Syntex
Microsoft SharePoint Syntex
 
Desktop integration & ECM
Desktop integration & ECMDesktop integration & ECM
Desktop integration & ECM
 
A MEDIA SHARING PLATFORM BUILT WITH OPEN SOURCE SOFTWARE
A MEDIA SHARING PLATFORM BUILT WITH OPEN SOURCE SOFTWAREA MEDIA SHARING PLATFORM BUILT WITH OPEN SOURCE SOFTWARE
A MEDIA SHARING PLATFORM BUILT WITH OPEN SOURCE SOFTWARE
 
Multimedia system(OPEN DOCUMENT ARCHITECTURE AND INTERCHANGING FORMAT)
Multimedia system(OPEN DOCUMENT ARCHITECTURE AND INTERCHANGING FORMAT)Multimedia system(OPEN DOCUMENT ARCHITECTURE AND INTERCHANGING FORMAT)
Multimedia system(OPEN DOCUMENT ARCHITECTURE AND INTERCHANGING FORMAT)
 
Multimedia system
Multimedia systemMultimedia system
Multimedia system
 
Using DITA for Online Help
Using DITA for Online HelpUsing DITA for Online Help
Using DITA for Online Help
 
Getting Ready for Project Cortex
Getting Ready for Project CortexGetting Ready for Project Cortex
Getting Ready for Project Cortex
 
Tech WG report 2011
Tech WG report 2011Tech WG report 2011
Tech WG report 2011
 
Zaven Akopov (DESY -L-) For the INSPIRE Collaboration DESY ...
Zaven Akopov (DESY -L-) For the INSPIRE Collaboration DESY ...Zaven Akopov (DESY -L-) For the INSPIRE Collaboration DESY ...
Zaven Akopov (DESY -L-) For the INSPIRE Collaboration DESY ...
 
Fedora Overview
Fedora OverviewFedora Overview
Fedora Overview
 
Hypatia for dlf 2011
Hypatia for dlf 2011Hypatia for dlf 2011
Hypatia for dlf 2011
 

Plus de Jukka Zitting

MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStore
Jukka Zitting
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
Jukka Zitting
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical model
Jukka Zitting
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
Jukka Zitting
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache Jackrabbi
Jukka Zitting
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
Jukka Zitting
 

Plus de Jukka Zitting (16)

The new repository in AEM 6
The new repository in AEM 6The new repository in AEM 6
The new repository in AEM 6
 
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CIApache development with GitHub and Travis CI
Apache development with GitHub and Travis CI
 
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3
 
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository
 
MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStore
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
 
Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011
 
OSGifying the repository
OSGifying the repositoryOSGifying the repository
OSGifying the repository
 
Repository performance tuning
Repository performance tuningRepository performance tuning
Repository performance tuning
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical model
 
NoSQL Oakland
NoSQL OaklandNoSQL Oakland
NoSQL Oakland
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache Jackrabbi
 
File System On Steroids
File System On SteroidsFile System On Steroids
File System On Steroids
 
Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of Jackrabbit
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Apache Tika

  • 1. Apache Tika An extensible, configurable content analysis framework toolkit
  • 2. Agenda The Problem The Solution The Project The Design
  • 3. The Problem PDFBox Apache Poi Apache Xerces ICU4J NekoHTML etc. Lucene index
  • 4. It’s Worse Than That Licensing Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming Processing of digital media ? ? ? ? ? ? ? ?
  • 5. Agenda The Problem The Solution The Project The Design
  • 6.
  • 7.
  • 8. Agenda The Problem The Solution The Project The Design
  • 9.
  • 10.
  • 12. Codebase History Lius Nutch Lius Lite Tika textmining Jackrabbit Andy Clark Jukka Zitting Rida Benjelloun Chris Mattman Jerome Charron Sami Siren Bertrand Delacretaz Keith Bennett
  • 13. Agenda The Problem The Solution The Project The Design
  • 14. Content Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting new PowerPointParser().parse(…);
  • 15. Media Type Detection application/vnd.ms-powerpoint MimeTypes types = …; MimeType type = types.getMimeType(…); tika-mimetypes.xml /etc/magic mime.types ?
  • 16. Combined Detection and Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting TXT PDF XML new AutoDetectParser().parse(…); ?
  • 17. Agenda The Problem The Solution The Project The Design Thank You!