Introducing Hydra – An Open Source Document Processing Framework

•

1 j'aime•2,134 vues

Presented by Joel Westberg, Findwise AB - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 This presentation will detail the document-processing framework called Hydra that has been developed by Findwise. It is intended as a description of the framework and the problem it aims to solve. We will first discuss the need for scalable document processing, outlining that there is a missing link between the open source chain to bridge the gap between source system and the search engine, then will move on to describe the design goals of Hydra, as well as how it has been implemented to meet those demands on flexibility, robustness and ease of use. This session will end by discussing some of the possibilities that this new pipeline framework can offer, such as freely seamlessly scaling up the solution during peak loads, metadata enrichment as well as proposed integration with Hadoop for Map/Reduce tasks such as page rank calculations.

Technologie Formation

Introducing Hydra !
An Open Source Document Processing Framework!

Joel
Westberg

©
FINDWISE
2012

About Findwise

•  Founded
in
2005

•  Oﬃces
in
Sweden,
Denmark,

Norway
and
Poland

•  80
employees
(April
2012)

• 

Our
objecGve
is
to
be
a
leading
provider
of
Findability
soluGons
uGlising

the
full
potenGal
of
search
technology
to
create
customer
business
value

The image cannot be displayed. Your computer may
not have enough memory to open the image, or the
image may have been corrupted. Restart your
computer, and then open the ﬁle again. If the red x
still appears, you may have to delete the image and
then insert it again.

Technology independent!
CreaGng
search-‐driven
Findability
soluGons
based
on
market-‐leading

commercial
and
open
source
search
technology
plaYorms:

ü 

Autonomy
IDOL

ü 

MicrosoB
(SharePoint
and
FAST
Search
products)

ü 

Google
GSA

ü 

IBM
ICA/OmniFind

ü  LucidWorks

ü 

Apache
Lucene/Solr

ü 

ElasGc
Search

ü 

and
more…

Connecting source to search!
Garbage
in,
garbage
out.
But
what
about
unstructured
data
in?

•  Flat
data
is
richer
than
it
appears

•  Don’t
discard
informaGon
too
soon!

The
unstructured
structured
data
paradox

Example:
News
arGcles

Plain
text
that
contains
invaluable
metadata
for
search,
such
as:

•  Title

•  Author
byline

•  Lead
paragraph

Enrichment and structuring possibilities!

•  Enrich
your
documents
with
metadata,
to
power
your
search

•  Language detection
•  Sentiment analysis
•  Headline extraction
•  Regular expression matching and extraction
•  Filter
out
unwanted
documents

•  Collect
staGsGcs

•  Export
to
Staging
environments

Main Design Objectives!
Scalability

•  Horizontally scalable central repository
•  Independent processing nodes
Failiure
tolerant

•  Failiure of a stage affects only a single document
•  Failiure of a node affects at most n documents
•  Failiures can be automaticly detected
Robustness

•  Independent stages
Development
ease

•  Debug stages from IDE against actual data
•  Allow test driven pipeline development

Writing a Stage - Example!
@Stage(descripGon="This
is
a
Simple
Writer")

public
class
SimpleWriter
extends
AbstractProcessStage
{

@Parameter(descripGon="Name
of
ﬁeld
to
write
value
to")

private
String
ﬁeld;

@Parameter(descripGon="Value
to
write")

private
Object
value;

@Override

public
void
process(LocalDocument
doc)
throws
ProcessExcepGon
{

doc.putContentField(ﬁeld,
value);

}

@Override

public
void
init()
throws
RequiredArgumentMissingExcepGon
{

if(ﬁeld==null)
throw
new
RequiredArgumentMissingExcepGon("ﬁeld
is
missing");

}

}

Hadoop/Big Data integration!

Usecases
for
document
enrichment

•  Pagerank

•  AnalyGcs

Hadoop
&
Map/Reduce
advantages

•  Huge
scalability

•  Ability
to
work
on
enGre
document
set
at
once

Hadoop
&
Map/Reduce
drawbacks

•  Batch
processing

•  Time-‐to-‐index

Hadoop/Big Data integration!

Blue – First round of indexing only
Red – Second round of indexing
Purple – All documents

Open Source initiative!

•  Other
commi?ers

•  The
role
of
Findwise

For
more
informaNon:

•  h?p://www.ﬁndwise.com/hydra

•  h?p://ﬁndwise.github.com/Hydra

•  Email:
joel.westberg@ﬁndwise.com

Thank Joel
Westberg

joel.westberg@ﬁndwise.com

you!!

Contenu connexe

Tendances

Dremio introductionAlexis Gendronneau

Bi on Big Data - Strata 2016 in LondonDremio Corporation

Practical Use of a NoSQLIBM Cloud Data Services

ArnoCandelAIFrontiers011217Sri Ambati

Skutil - H2O meets Sklearn - Taylor SmithSri Ambati

Optiq: A dynamic data management frameworkJulian Hyde

Building a Virtual Data Lake with Apache ArrowDremio Corporation

ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkCarolyn Duby

Intro to Big Data - SparkSofian Hadiwijaya

Practical Use of a NoSQL DatabaseIBM Cloud Data Services

Data Science at Scale by Sarah GuidoSpark Summit

Azure DocumentDB 101Ike Ellis

Intro to H2O in Python - Data Science LASri Ambati

Apache Accumulo and the Data LakeAaron Cordova

NoSQL for SQL UsersIBM Cloud Data Services

AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012Amazon Web Services

E Afgan - Zero to a bioinformatics analysis platform in four minutesJan Aerts

Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit

Big Data on azureDavid Giard

Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation

Tendances (20)

Dremio introduction

Bi on Big Data - Strata 2016 in London

Practical Use of a NoSQL

ArnoCandelAIFrontiers011217

Skutil - H2O meets Sklearn - Taylor Smith

Optiq: A dynamic data management framework

Building a Virtual Data Lake with Apache Arrow

ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark

Intro to Big Data - Spark

Practical Use of a NoSQL Database

Data Science at Scale by Sarah Guido

Azure DocumentDB 101

Intro to H2O in Python - Data Science LA

Apache Accumulo and the Data Lake

NoSQL for SQL Users

AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012

E Afgan - Zero to a bioinformatics analysis platform in four minutes

Data infrastructure architecture for medium size organization: tips for colle...

Big Data on azure

Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

Similaire à Introducing Hydra – An Open Source Document Processing Framework

Hydra - Content Processing Framework for Search Driven SolutionsFindwise

Designing and Implementing Search SolutionsFindwise

Searching Chinese Patents Presentation at Enterprise Data WorldOpenSource Connections

Demystifying data engineeringThang Bui (Bob)

Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido

Architecting Your First Big Data ImplementationAdaryl "Bob" Wakefield, MBA

Webinar: Managing Real Time Risk Analytics with MongoDB MongoDB

No SQL : Which way to go? Presented at DDDMelbourne 2015Himanshu Desai

NoSQL, which way to go?Ahmed Elharouny

Apache Spark FundamentalsZahra Eskandari

Data Modeling for NoSQLTony Tam

Why we love ArangoDB. The hunt for the right NosQL DatabaseAndreas Jung

From a student to an apache committer practice of apache io tdbjixuan1989

Liferay & Big Data Dev Con 2014Miguel Pastor

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

BuildingsocialanalyticstoolwithmongodbMongoDB APAC

Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...VMware Tanzu

SQL Server Konferenz 2014 - SSIS & HDInsightTillmann Eitelberg

USQL Trivadis Azure Data Lake EventTrivadis

How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences

Similaire à Introducing Hydra – An Open Source Document Processing Framework (20)

Hydra - Content Processing Framework for Search Driven Solutions

Designing and Implementing Search Solutions

Searching Chinese Patents Presentation at Enterprise Data World

Demystifying data engineering

Data Science at Scale: Using Apache Spark for Data Science at Bitly

Architecting Your First Big Data Implementation

Webinar: Managing Real Time Risk Analytics with MongoDB

No SQL : Which way to go? Presented at DDDMelbourne 2015

NoSQL, which way to go?

Apache Spark Fundamentals

Data Modeling for NoSQL

Why we love ArangoDB. The hunt for the right NosQL Database

From a student to an apache committer practice of apache io tdb

Liferay & Big Data Dev Con 2014

Apache Spark for Everyone - Women Who Code Workshop

Buildingsocialanalyticstoolwithmongodb

Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...

SQL Server Konferenz 2014 - SSIS & HDInsight

USQL Trivadis Azure Data Lake Event

How to use Big Data and Data Lake concept in business using Hadoop and Spark...

Plus de lucenerevolution

Text Classification Powered by Apache Mahout and Lucenelucenerevolution

State of the Art Logging. Kibana4Solr is Here! lucenerevolution

Search at Twitterlucenerevolution

Building Client-side Search Applications with Solrlucenerevolution

Integrate Solr with real-time stream processing applicationslucenerevolution

Scaling Solr with SolrCloudlucenerevolution

Administering and Monitoring SolrCloud Clusterslucenerevolution

Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution

Using Solr to Search and Analyze Logs lucenerevolution

Enhancing relevancy through personalization & semantic searchlucenerevolution

Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution

Solr's Admin UI - Where does the data come from?lucenerevolution

Schemaless Solr and the Solr Schema REST APIlucenerevolution

High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution

Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution

Faceted Search with Lucenelucenerevolution

Recent Additions to Lucene Arsenallucenerevolution

Turning search upside downlucenerevolution

Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution

Shrinking the haystack wes caldwell - finallucenerevolution

Plus de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene

State of the Art Logging. Kibana4Solr is Here!

Search at Twitter

Building Client-side Search Applications with Solr

Integrate Solr with real-time stream processing applications

Scaling Solr with SolrCloud

Administering and Monitoring SolrCloud Clusters

Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled

Using Solr to Search and Analyze Logs

Enhancing relevancy through personalization & semantic search

Real-time Inverted Search in the Cloud Using Lucene and Storm

Solr's Admin UI - Where does the data come from?

Schemaless Solr and the Solr Schema REST API

High Performance JSON Search and Relational Faceted Browsing with Lucene

Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Faceted Search with Lucene

Recent Additions to Lucene Arsenal

Turning search upside down

Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...

Shrinking the haystack wes caldwell - final

Dernier

A Call to Action for Generative AI in 2024Results

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Histor y of HAM Radio presentation slidevu2urc

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

How to convert PDF to text with Nanonetsnaman860154

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Dernier (20)

A Call to Action for Generative AI in 2024

CNv6 Instructor Chapter 6 Quality of Service

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

How to Troubleshoot Apps for the Modern Connected Worker

Boost PC performance: How more available memory can improve productivity

Powerful Google developer tools for immediate impact! (2023-24 C)

08448380779 Call Girls In Friends Colony Women Seeking Men

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Histor y of HAM Radio presentation slide

Axa Assurance Maroc - Insurer Innovation Award 2024

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Presentation on how to chat with PDF using ChatGPT code interpreter

Automating Google Workspace (GWS) & more with Apps Script

08448380779 Call Girls In Civil Lines Women Seeking Men

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

How to convert PDF to text with Nanonets

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Introducing Hydra – An Open Source Document Processing Framework

2. About Findwise •  Founded in 2005 •  Oﬃces in Sweden, Denmark, Norway and Poland •  80 employees (April 2012) •  Our objecGve is to be a leading provider of Findability soluGons uGlising the full potenGal of search technology to create customer business value

3. The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the ﬁle again. If the red x still appears, you may have to delete the image and then insert it again.

4. Technology independent! CreaGng search-‐driven Findability soluGons based on market-‐leading commercial and open source search technology plaYorms: ü  Autonomy IDOL ü  MicrosoB (SharePoint and FAST Search products) ü  Google GSA ü  IBM ICA/OmniFind ü  LucidWorks ü  Apache Lucene/Solr ü  ElasGc Search ü  and more…

5. Generic Search Architecture !

6. Connecting source to search! Garbage in, garbage out. But what about unstructured data in? •  Flat data is richer than it appears •  Don’t discard informaGon too soon! The unstructured structured data paradox Example: News arGcles Plain text that contains invaluable metadata for search, such as: •  Title •  Author byline •  Lead paragraph

7. Enrichment and structuring possibilities! •  Enrich your documents with metadata, to power your search •  Language detection •  Sentiment analysis •  Headline extraction •  Regular expression matching and extraction •  Filter out unwanted documents •  Collect staGsGcs •  Export to Staging environments

8. Classic Pipeline!

9. Classic Architecture!

10. The Hydra Architecture!

11. Main Design Objectives! Scalability •  Horizontally scalable central repository •  Independent processing nodes Failiure tolerant •  Failiure of a stage affects only a single document •  Failiure of a node affects at most n documents •  Failiures can be automaticly detected Robustness •  Independent stages Development ease •  Debug stages from IDE against actual data •  Allow test driven pipeline development

12. The Hydra Architecture!

13. Writing a Stage - Example! @Stage(descripGon="This is a Simple Writer") public class SimpleWriter extends AbstractProcessStage { @Parameter(descripGon="Name of field to write value to") private String field; @Parameter(descripGon="Value to write") private Object value; @Override public void process(LocalDocument doc) throws ProcessExcepGon { doc.putContentField(field, value); } @Override public void init() throws RequiredArgumentMissingExcepGon { if(field==null) throw new RequiredArgumentMissingExcepGon("field is missing"); } }

14. Hadoop/Big Data integration! Usecases for document enrichment •  Pagerank •  AnalyGcs Hadoop & Map/Reduce advantages •  Huge scalability •  Ability to work on enGre document set at once Hadoop & Map/Reduce drawbacks •  Batch processing •  Time-‐to-‐index

15. Hadoop/Big Data integration! Blue – First round of indexing only Red – Second round of indexing Purple – All documents

16. Future Conﬁguration UI!

17. Open Source initiative! •  Other commi?ers •  The role of Findwise For more informaNon: •  h?p://www.findwise.com/hydra •  h?p://findwise.github.com/Hydra •  Email: joel.westberg@findwise.com

18. Questions?!

19. Thank Joel Westberg joel.westberg@ﬁndwise.com you!!

Introducing Hydra – An Open Source Document Processing Framework

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Introducing Hydra – An Open Source Document Processing Framework

Similaire à Introducing Hydra – An Open Source Document Processing Framework (20)

Plus de lucenerevolution

Plus de lucenerevolution (20)

Dernier

Dernier (20)

Introducing Hydra – An Open Source Document Processing Framework