Submit Search
Upload
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
•
0 likes
•
362 views
OpenSource Connections
Follow
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Read less
Read more
Data & Analytics
Slideshow view
Report
Share
Slideshow view
Report
Share
1 of 13
Download now
Download to read offline
Recommended
IPTC Semantic Web 2012 Spring Working Group
IPTC Semantic Web 2012 Spring Working Group
Stuart Myles
IPTC Semantic Web Working Group 2011 Autumn Working Group
IPTC Semantic Web Working Group 2011 Autumn Working Group
Stuart Myles
Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...
Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...
BigData_Europe
IPTC Semantic Web Working Group Summer 2012
IPTC Semantic Web Working Group Summer 2012
Stuart Myles
Plans for Enhanced NetCDF-4 Interface to HDF5 Data
Plans for Enhanced NetCDF-4 Interface to HDF5 Data
The HDF-EOS Tools and Information Center
Haystack 2018 apache_tika-eval_tallison
Haystack 2018 apache_tika-eval_tallison
Tim Allison
Distributed deep learning reference architecture v3.2l
Distributed deep learning reference architecture v3.2l
Ganesan Narayanasamy
HPC Networking in the Real World
HPC Networking in the Real World
inside-BigData.com
Recommended
IPTC Semantic Web 2012 Spring Working Group
IPTC Semantic Web 2012 Spring Working Group
Stuart Myles
IPTC Semantic Web Working Group 2011 Autumn Working Group
IPTC Semantic Web Working Group 2011 Autumn Working Group
Stuart Myles
Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...
Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...
BigData_Europe
IPTC Semantic Web Working Group Summer 2012
IPTC Semantic Web Working Group Summer 2012
Stuart Myles
Plans for Enhanced NetCDF-4 Interface to HDF5 Data
Plans for Enhanced NetCDF-4 Interface to HDF5 Data
The HDF-EOS Tools and Information Center
Haystack 2018 apache_tika-eval_tallison
Haystack 2018 apache_tika-eval_tallison
Tim Allison
Distributed deep learning reference architecture v3.2l
Distributed deep learning reference architecture v3.2l
Ganesan Narayanasamy
HPC Networking in the Real World
HPC Networking in the Real World
inside-BigData.com
Embedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern Staender
Dataconomy Media
MITRE ATT&CKcon 2.0: ATT&CK Updates - Cyber Analytics Repository (CAR); Ivan ...
MITRE ATT&CKcon 2.0: ATT&CK Updates - Cyber Analytics Repository (CAR); Ivan ...
MITRE - ATT&CKcon
Research data management 1.5
Research data management 1.5
John Martin
Research and technology explosion in scale-out storage
Research and technology explosion in scale-out storage
Jeff Spencer
ApI first Microservices meetup
ApI first Microservices meetup
Oracle Developers
FIWARE and Smart Data Models
FIWARE and Smart Data Models
Fernando Lopez Aguilar
IBM Aspera overview
IBM Aspera overview
Carlos Martin Hernandez
DATAOPS: THE NEXT BIG WAVE ON YOUR DATA JOURNEY - Big Data Expo
DATAOPS: THE NEXT BIG WAVE ON YOUR DATA JOURNEY - Big Data Expo
webwinkelvakdag
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
javier ramirez
Kafka at Peak Performance
Kafka at Peak Performance
Todd Palino
Hyperledger weatherreport20190219 公開版
Hyperledger weatherreport20190219 公開版
Hyperleger Tokyo Meetup
Replacing Tape Backup with Cloud-Enabled Solutions by Index Engines
Replacing Tape Backup with Cloud-Enabled Solutions by Index Engines
Amazon Web Services
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
OpenSource Connections
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
Ridwan Fadjar
Enterprise Data Lakes
Enterprise Data Lakes
Farid Gurbanov
Building-a-Modern-Data-Platform-in-the-Cloud.pdf
Building-a-Modern-Data-Platform-in-the-Cloud.pdf
Amazon Web Services
Model-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data Analytics
Cisco Canada
OSINT: Open Source Intelligence - Rohan Braganza
OSINT: Open Source Intelligence - Rohan Braganza
NSConclave
Mulesoft Meetup Milano #11.pdf
Mulesoft Meetup Milano #11.pdf
Florence Consulting
Implementing Machine Learning Incrementally
Implementing Machine Learning Incrementally
Ravindra Guntur
Encores
Encores
OpenSource Connections
Test driven relevancy
Test driven relevancy
OpenSource Connections
More Related Content
Similar to Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Embedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern Staender
Dataconomy Media
MITRE ATT&CKcon 2.0: ATT&CK Updates - Cyber Analytics Repository (CAR); Ivan ...
MITRE ATT&CKcon 2.0: ATT&CK Updates - Cyber Analytics Repository (CAR); Ivan ...
MITRE - ATT&CKcon
Research data management 1.5
Research data management 1.5
John Martin
Research and technology explosion in scale-out storage
Research and technology explosion in scale-out storage
Jeff Spencer
ApI first Microservices meetup
ApI first Microservices meetup
Oracle Developers
FIWARE and Smart Data Models
FIWARE and Smart Data Models
Fernando Lopez Aguilar
IBM Aspera overview
IBM Aspera overview
Carlos Martin Hernandez
DATAOPS: THE NEXT BIG WAVE ON YOUR DATA JOURNEY - Big Data Expo
DATAOPS: THE NEXT BIG WAVE ON YOUR DATA JOURNEY - Big Data Expo
webwinkelvakdag
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
javier ramirez
Kafka at Peak Performance
Kafka at Peak Performance
Todd Palino
Hyperledger weatherreport20190219 公開版
Hyperledger weatherreport20190219 公開版
Hyperleger Tokyo Meetup
Replacing Tape Backup with Cloud-Enabled Solutions by Index Engines
Replacing Tape Backup with Cloud-Enabled Solutions by Index Engines
Amazon Web Services
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
OpenSource Connections
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
Ridwan Fadjar
Enterprise Data Lakes
Enterprise Data Lakes
Farid Gurbanov
Building-a-Modern-Data-Platform-in-the-Cloud.pdf
Building-a-Modern-Data-Platform-in-the-Cloud.pdf
Amazon Web Services
Model-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data Analytics
Cisco Canada
OSINT: Open Source Intelligence - Rohan Braganza
OSINT: Open Source Intelligence - Rohan Braganza
NSConclave
Mulesoft Meetup Milano #11.pdf
Mulesoft Meetup Milano #11.pdf
Florence Consulting
Implementing Machine Learning Incrementally
Implementing Machine Learning Incrementally
Ravindra Guntur
Similar to Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
(20)
Embedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern Staender
MITRE ATT&CKcon 2.0: ATT&CK Updates - Cyber Analytics Repository (CAR); Ivan ...
MITRE ATT&CKcon 2.0: ATT&CK Updates - Cyber Analytics Repository (CAR); Ivan ...
Research data management 1.5
Research data management 1.5
Research and technology explosion in scale-out storage
Research and technology explosion in scale-out storage
ApI first Microservices meetup
ApI first Microservices meetup
FIWARE and Smart Data Models
FIWARE and Smart Data Models
IBM Aspera overview
IBM Aspera overview
DATAOPS: THE NEXT BIG WAVE ON YOUR DATA JOURNEY - Big Data Expo
DATAOPS: THE NEXT BIG WAVE ON YOUR DATA JOURNEY - Big Data Expo
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
Kafka at Peak Performance
Kafka at Peak Performance
Hyperledger weatherreport20190219 公開版
Hyperledger weatherreport20190219 公開版
Replacing Tape Backup with Cloud-Enabled Solutions by Index Engines
Replacing Tape Backup with Cloud-Enabled Solutions by Index Engines
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
Enterprise Data Lakes
Enterprise Data Lakes
Building-a-Modern-Data-Platform-in-the-Cloud.pdf
Building-a-Modern-Data-Platform-in-the-Cloud.pdf
Model-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data Analytics
OSINT: Open Source Intelligence - Rohan Braganza
OSINT: Open Source Intelligence - Rohan Braganza
Mulesoft Meetup Milano #11.pdf
Mulesoft Meetup Milano #11.pdf
Implementing Machine Learning Incrementally
Implementing Machine Learning Incrementally
More from OpenSource Connections
Encores
Encores
OpenSource Connections
Test driven relevancy
Test driven relevancy
OpenSource Connections
How To Structure Your Search Team for Success
How To Structure Your Search Team for Success
OpenSource Connections
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019
OpenSource Connections
Payloads and OCR with Solr
Payloads and OCR with Solr
OpenSource Connections
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
OpenSource Connections
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
OpenSource Connections
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
OpenSource Connections
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
OpenSource Connections
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
OpenSource Connections
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
OpenSource Connections
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
OpenSource Connections
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
OpenSource Connections
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
OpenSource Connections
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
OpenSource Connections
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
OpenSource Connections
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
OpenSource Connections
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
OpenSource Connections
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
OpenSource Connections
Haystack 2019 - Beyond The Search Engine: Improving Relevancy through Query E...
Haystack 2019 - Beyond The Search Engine: Improving Relevancy through Query E...
OpenSource Connections
More from OpenSource Connections
(20)
Encores
Encores
Test driven relevancy
Test driven relevancy
How To Structure Your Search Team for Success
How To Structure Your Search Team for Success
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019
Payloads and OCR with Solr
Payloads and OCR with Solr
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Haystack 2019 - Beyond The Search Engine: Improving Relevancy through Query E...
Haystack 2019 - Beyond The Search Engine: Improving Relevancy through Query E...
Recently uploaded
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
ThinkInnovation
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
Paris Women in Machine Learning and Data Science
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
EfruzAsilolu
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
amy56318795
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
ibrahimabdi22
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
Rajesh Mondal
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
Vivek487417
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
Elaine Werffeli
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
EfruzAsilolu
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
Timothy Spann
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
Recently uploaded
(20)
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
1.
© 2019 The
MITRE Corporation. All rights reserved. Apache Tika Tim Allison tallison@apache.org, @_tallison April 24, 2019 Haystack Conference Approved for Public Release; Distribution Unlimited. Case Number 18-3138-6
2.
| 2 | ©
2019 The MITRE Corporation. All rights reserved. Overview ▪ What is Tika ▪ tika-eval ▪ Running Tika safely ▪ Coming out in 1.21 and beyond
3.
| 3 | ©
2019 The MITRE Corporation. All rights reserved. Text/Metadata Extraction
4.
| 4 | ©
2019 The MITRE Corporation. All rights reserved. Things Can Happen ▪ Tired: – Exceptions – Unsupported file formats – Encrypted files – Garbled text – Missing text ▪ Wired: – OOM – Seg fault – Infinite loops – Multithreaded garbage collector pegging all CPU resources
5.
| 5 | Stands
up on Soap Box
6.
| 6 | ©
2019 The MITRE Corporation. All rights reserved. Upgrade from PDFBox 1.8.6->1.8.7
7.
| 7 | ©
2019 The MITRE Corporation. All rights reserved. Soap Box If your search system can’t tell the difference between those two…
8.
| 8 | ©
2019 The MITRE Corporation. All rights reserved. Soap Box If your search system can’t tell the difference between those two… You don’t have a search system.
9.
| 9 | ©
2019 The MITRE Corporation. All rights reserved. Soap Box If your search system can’t tell the difference between those two… 👍You’ve got a neat, little demo!👍 You don’t have a search system.
10.
| 10 | Steps
Off of Soap Box
11.
| 11 | ©
2019 The MITRE Corporation. All rights reserved. tika-eval ▪ Profile individual runs ▪ Compare two runs ▪ Exceptions by mime ▪ Out of vocabulary (OOV) statistics
12.
| 12 | ©
2019 The MITRE Corporation. All rights reserved. tika-eval: Eating our own dog food ▪ 3 million files (~1 TB) from Common Crawl and govdocs1 hosted on a public virtual machine, provided by Rackspace ▪ Code to profile a single run or compare two runs before release ▪ Evaluation methodology co-developed with and now co-run by open source colleagues (around the world) on the MSOffice parser project and the PDF parser project
13.
| 13 | ©
2019 The MITRE Corporation. All rights reserved. Tika 1.21 and beyond ▪ Tika 1.21 – csv/tsv detector and parser (Apache commons-csv) – Improved zip-based (.docx, .pptx, .xlsx) file detection and parsing ▪ Beyond – Modularize tika-eval and include stats within the extract for scalability and aggregation of stats w/in Solr/Elastic – Increase coverage/speed of zip-based file detection; can we move entirely to streaming detection? – Improve language coverage/lang id component w/in tika-eval ▪ Help! – What do you need? – How can you help us help you?
Download now