Presented by Joel Westberg, Findwise AB - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
This presentation will detail the document-processing framework called Hydra that has been developed by Findwise. It is intended as a description of the framework and the problem it aims to solve. We will first discuss the need for scalable document processing, outlining that there is a missing link between the open source chain to bridge the gap between source system and the search engine, then will move on to describe the design goals of Hydra, as well as how it has been implemented to meet those demands on flexibility, robustness and ease of use. This session will end by discussing some of the possibilities that this new pipeline framework can offer, such as freely seamlessly scaling up the solution during peak loads, metadata enrichment as well as proposed integration with Hadoop for Map/Reduce tasks such as page rank calculations.
2. About Findwise
• Founded
in
2005
• Offices
in
Sweden,
Denmark,
Norway
and
Poland
• 80
employees
(April
2012)
•
Our
objecGve
is
to
be
a
leading
provider
of
Findability
soluGons
uGlising
the
full
potenGal
of
search
technology
to
create
customer
business
value
3. The image cannot be displayed. Your computer may
not have enough memory to open the image, or the
image may have been corrupted. Restart your
computer, and then open the file again. If the red x
still appears, you may have to delete the image and
then insert it again.
4. Technology independent!
CreaGng
search-‐driven
Findability
soluGons
based
on
market-‐leading
commercial
and
open
source
search
technology
plaYorms:
ü
Autonomy
IDOL
ü
MicrosoB
(SharePoint
and
FAST
Search
products)
ü
Google
GSA
ü
IBM
ICA/OmniFind
ü LucidWorks
ü
Apache
Lucene/Solr
ü
ElasGc
Search
ü
and
more…
6. Connecting source to search!
Garbage
in,
garbage
out.
But
what
about
unstructured
data
in?
• Flat
data
is
richer
than
it
appears
• Don’t
discard
informaGon
too
soon!
The
unstructured
structured
data
paradox
Example:
News
arGcles
Plain
text
that
contains
invaluable
metadata
for
search,
such
as:
• Title
• Author
byline
• Lead
paragraph
7. Enrichment and structuring possibilities!
• Enrich
your
documents
with
metadata,
to
power
your
search
• Language detection
• Sentiment analysis
• Headline extraction
• Regular expression matching and extraction
• Filter
out
unwanted
documents
• Collect
staGsGcs
• Export
to
Staging
environments
11. Main Design Objectives!
Scalability
• Horizontally scalable central repository
• Independent processing nodes
Failiure
tolerant
• Failiure of a stage affects only a single document
• Failiure of a node affects at most n documents
• Failiures can be automaticly detected
Robustness
• Independent stages
Development
ease
• Debug stages from IDE against actual data
• Allow test driven pipeline development
13. Writing a Stage - Example!
@Stage(descripGon="This
is
a
Simple
Writer")
public
class
SimpleWriter
extends
AbstractProcessStage
{
@Parameter(descripGon="Name
of
field
to
write
value
to")
private
String
field;
@Parameter(descripGon="Value
to
write")
private
Object
value;
@Override
public
void
process(LocalDocument
doc)
throws
ProcessExcepGon
{
doc.putContentField(field,
value);
}
@Override
public
void
init()
throws
RequiredArgumentMissingExcepGon
{
if(field==null)
throw
new
RequiredArgumentMissingExcepGon("field
is
missing");
}
}
14. Hadoop/Big Data integration!
Usecases
for
document
enrichment
• Pagerank
• AnalyGcs
Hadoop
&
Map/Reduce
advantages
• Huge
scalability
• Ability
to
work
on
enGre
document
set
at
once
Hadoop
&
Map/Reduce
drawbacks
• Batch
processing
• Time-‐to-‐index
15. Hadoop/Big Data integration!
Blue – First round of indexing only
Red – Second round of indexing
Purple – All documents
17. Open Source initiative!
• Other
commi?ers
• The
role
of
Findwise
For
more
informaNon:
• h?p://www.findwise.com/hydra
• h?p://findwise.github.com/Hydra
• Email:
joel.westberg@findwise.com