SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

Yann Yu
Systems Engineer @ Lucidworks
Who am I?

Lucidworks is Search.
Technology Retail
Financial
Services
IndustrialHealthcare

Why would you integrate Hadoop and Solr?
(and how would you do that?)

• Open-source
• Enterprise support
• Cheap, scalable storage
• Distributed computation
• Farm animals for extensibility
• Open-source, Lucene based
• Enterprise support
• Real-time queries
• Full-text search
• NoSQL capabilities
• Repeatedly proven in production
environments at massive scales

I have Hadoop, why do I need Solr?
• NoSQL front-end to Hadoop: Enable fast, ad-hoc, search across
structured and unstructured big data
• Empower users of all technical ability to interact with, and derive
value from, big data — all using a natural language search interface
(no MapReduce, Pig, SQL, etc.)
• Preliminary data exploration and analysis
• Near real-time indexing and querying
• Thousands of simultaneous, parallel requests
• Share machine-learning insights created on Hadoop to a broad
audience through an interactive medium
Hadoop excels in storing and working with large amounts of data,
but has difﬁculty with frequent, random access to it

I have Solr, why do I need Hadoop?
• Least expensive storage solution in market
• Leverage Hadoop processing power (MapReduce) to build
indexes or send document updates to Solr
• Store Solr indexes and transaction logs within HDFS
• Augment Solr data by storing additional information for last-
second retrieval in Hadoop
As Solr indexes grow in size, the size and number of the machines hosting Solr
must also grow, increasing index time and complexity

?
So what does this actually look like?

The enterprise storage situation today
⚒

Enterprise data deployment
Lucidworks HDFS connector
processes documents and
sends to SolrCloud
Enterprise documents
are stored in HDFS
Users make ad-hoc, full-text
queries across the full content
of all documents in Solr
And retrieve source
ﬁles directly from
HDFS as necessary
Standard document storage and search

• Documents can be migrated from other ﬁle
storage systems via Flume or other scripts
• MapReduce allows for batch processing of
documents (e.g. OCR, NER, clustering, etc.)
Sink documents into HDFS

Index document contents into Solr
• The Lucidworks Hadoop
connector parses content from
ﬁles using many different tools
• Tika, GrokIngest, CSV
mapping, Pig, etc.
• Content and data are added to
ﬁelds in a Solr document
• The resulting document is sent
to Solr for indexing

• Users are empowered with ad-hoc,
full-text search in Solr
• Provides standard search tools
such as autocomplete, more-like-
this, spellchecking, faceting, etc.
• Users only access HDFS as needed
Enable users to search and access content

Log record search
Machine generated log records
are sent to Flume.
Flume forwards raw log record
to Hadoop for archiving.
Flume simultaneously parses out
data in record into a Solr document,
forwarding resulting document to Solr
Lucidworks SiLK exposes real-time
statistics and analytics to end-users,
as well as full-text search
High volume indexing of many small records

Flume archives data in HDFS
• Flume performs minimal work on log
ﬁles and sends them directly into
HDFS for archival
• Under optimal circumstances, the log
ﬁles are sized to the block size of
HDFS

Flume submits records to Solr
• Flume processes records, extracting
strings, ints, dates, times, and other
information into Solr ﬁelds
• Once the Solr document is created, it
is submitted to Solr for indexing
• This process happens in real-time,
allowing for near real-time search

Real-time analytics dashboard
• Lucidworks SiLK allows users to create
simple dashboards through a GUI
• The Banana dashboard will issue queries
to Solr, rendering the received data in
tables, graphs, and other plots
• Users can also perform full-text search
across the data, allowing for extremely
ﬁne granularity

End
Any questions?
Find me at:
yann.yu@lucidworks.com
@yawnyou

SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

Similaire à SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr (20)

Plus de Lucidworks (Archived)

Plus de Lucidworks (Archived) (20)

Dernier

Dernier (20)

SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr