2013 11-07 lsr-dublin_m_hausenblas_when solr is best

•

0 likes•1,096 views

This document discusses when Solr is a good tool to use versus other options. It provides an overview of Solr in the big data ecosystem and the concept of polyglot persistence, where different data stores are used for different needs. Common use cases for Solr like search-based recommendations and log analysis are described. A checklist is presented for determining if Solr is a good fit based on factors like data volume, query characteristics, throughput needs, and data type. The document concludes by listing some red flags where Solr may not be suitable, such as if strong consistency, transactions, or graphs are needed requirements.

Technology

USE CASE DIAGNOSIS: WHEN IS SOLR
REALLY THE BEST TOOL?

Michael Hausenblas
Twitter: @mhausenblas

Chief Data Engineer EMEA, MapR Technologies

Agenda
• 
• 
• 
• 
• 

Solr in the Big Data ecosystem
Polyglot Persistence
Common (Big Data) use cases
A checklist
When not to use Solr …

processing
storage

Apache Pig

Apache Zookeeper

$ ls -al

$ tail –f some.log
$ nc localhost 80

awk 'BEGIN { FS = "," }
/2013-[[:digit:]]+-[[:digit:]]+/ { print $3 }’
sample.csv

tool box

one-size-fits-all

Polyglot Persistence—Backdrop
• 

Michael Stonebraker and Ugur Çetintemel—2005
"One Size Fits All": An Idea Whose Time Has Come and Gone

• 

Martin Fowler—2011
Polyglot Persistence1

• 

Eric Brewer—2012
Ricon Keynote—Advancing Distributed Systems2

1) http://martinfowler.com/bliki/PolyglotPersistence.html
2) https://speakerdeck.com/eric_brewer/ricon-2012-keynote

Polyglot Persistence—Key Points
• 

Use different datastores for different needs

• 

Can apply within an application or cross-enterprise

• 

Encapsulating data access yields loosely coupled components

• 

Find sweet spot between dev/op complexity and flexibility

Where are we coming from?
• 
• 
• 
• 
• 

Keyword search
Spellcheck & autosuggest
Ranking
Faceted search
Spatial search

Search-based recommendation (credit card issuer)
• 

Given
–  customer purchase history
–  merchant designations
–  merchant special offers

• 

Goal
–  Improve existing recommender system
–  Throughput important

Analyze with MapReduce

complete

history

Co-‐occurrence

(Mahout)

Item
meta-‐data

SolR

SolR

Solr

Indexer

Indexer

indexing

Index

shards

Deploy with search system

user

history

Web
>er

Item
meta-‐data

SolR

SolR

Solr

Indexer

Indexer

search

Index

shards

Log analysis
• 

Given
–  Receive 200,000+ log lines per second

• 

Goal
–  Want to do multi-field search
–  Want to search on log lines with <30 second delay before search

Data Ingestion and Indexing

incoming
data

Ka@a

SolR

SolR

Text

Indexer

Indexer

analysis

Solr

indexer

Real-‐>me

Raw

documents

Older
index

shards

Live
index

shard

>me-‐sharded
Solr
indexes

Search

Query

Solr

search

Web
>er

SolR

SolR

Solr

Indexer

Indexer

search

Raw

documents

Older
index

shards

Live
index

shard

Question you may want to ask …
• 

What is the volume of your data* (few GB? up to PB?)

• 

How are your query characteristics?
–  full scans
–  look-ups
–  multiple passes over large parts
–  continuous queries

• 

What’s (more) important: throughput or latency?

*)
Note:
as
long
as
Moore's
law
s>ll
holds,
these
ﬁgures
obviously
change
on
a
yearly
if
not
monthly
basis.

Key qualifiers
• 

Want exploratory interface rather than aggregates in a dashboard

• 

Data are sparse symbol sets like words or recommendation indicators

• 

Small-ish return sets are OK, especially if facets are good enough

• 

Near-real-time is good enough

Red Flags
• 

You need strong consistency?

• 

JOINS, anyone?

• 
• 
• 

reme
mber
:
one
ﬁt
all

size
d
—too
Want (complex) transactions?
l
belt
oes
n

appr
ot

oach!
OLTP, streaming (but: near-real-time)

Graphs?

Let’s stay in touch …

• 

Twitter:
@mhausenblas
@MapR

MapR
Nordics

MapR
UK

MapR
HQ

San
Jose,
US

MapR
DACH

MapR
Japan

MapR
SE
&
Benelux

MapR
Hyderbad

• 

We’re hiring!

MapR
Korea

What's hot

Use cases for cassandra in federal and state governmentOpenSource Connections

Big Search 4 Big Data War StoriesOpenSource Connections

Demystifying Data Engineeringnathanmarz

Dogfooding data at Lyftmarkgrover

Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Databricks

SplunkIntellipaat

Indexing big data in the cloudOpenSource Connections

Enterprise Search Europe 2015: Fishing the big data streams - the future of ...Charlie Hull

Data Day Seattle 2015: Sarah GuidoBitly

Intro to Python for C# DevelopersSarah Dutkiewicz

R reproducibilityRevolution Analytics

Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...lucenerevolution

Reproducible Data Science with RRevolution Analytics

Spark at ZillowSteven Hoelscher

Building a lightweight discovery interface for Chinese patentsOpenSource Connections

What's hot (15)

Use cases for cassandra in federal and state government

Big Search 4 Big Data War Stories

Demystifying Data Engineering

Dogfooding data at Lyft

Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...

Splunk

Indexing big data in the cloud

Enterprise Search Europe 2015: Fishing the big data streams - the future of ...

Data Day Seattle 2015: Sarah Guido

Intro to Python for C# Developers

R reproducibility

Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...

Reproducible Data Science with R

Spark at Zillow

Building a lightweight discovery interface for Chinese patents

Viewers also liked

Presentatie Wearable health monitoring devices#devdate

Lehman Hot Springs DEQ fine responseJ. Patrick Lucas

Solr for highly customized sitewide navigation - By Shantanu Deolucenerevolution

Nutrition basics-dossierFondation Louis Bonduelle

2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation enginelucenerevolution

The Typed Indexlucenerevolution

Viewers also liked (6)

Presentatie Wearable health monitoring devices

Lehman Hot Springs DEQ fine response

Solr for highly customized sitewide navigation - By Shantanu Deo

Nutrition basics-dossier

2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine

The Typed Index

Similar to 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Lucidworks

Building Search & Recommendation EnginesTrey Grainger

Solr for Data ScienceGrant Ingersoll

Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...TUMRA | Big Data Science - Gain a competitive advantage through Big Data & Data Science

How Solr Search WorksAtlogys Technical Consulting

DOXLON November 2016 - Data Democratization Using SplunkOutlyer

50 Shades of Data - how, when and why Big, Fast, Relational, NoSQL, Elastic, ...Lucas Jellema

The Apache Solr Smart Data EcosystemTrey Grainger

Strata sf - Amundsen presentationTao Feng

Dw 07032018-dr pl pradhanDr Pradhan PL Pradhan

Intro to Solr in Drupal Mediacurrent

Frank Bien Opening Keynote - Join 2016Looker

Elasticsearch - Scalability and MultitenancyBozhidar Bozhanov

Session #2, tech session: Build realtime search by Sylvain Utard from AlgoliaSaaS Is Beautiful

Improve Performance in Fast Search for SharePoint - ComperioComperio - Search Matters.

UnderstandingHowSolrCanHelpYourBusinessScale-ECG07.31.2013Kirill Morozov

Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics

Open Data Summit Presentation by Joe OlsenChristopher Whitaker

Building Scalable Aggregation SystemsJared Winick

Similar to 2013 11-07 lsr-dublin_m_hausenblas_when solr is best (20)

Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global

Building Search & Recommendation Engines

Solr for Data Science

Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...

How Solr Search Works

DOXLON November 2016 - Data Democratization Using Splunk

50 Shades of Data - how, when and why Big, Fast, Relational, NoSQL, Elastic, ...

The Apache Solr Smart Data Ecosystem

Strata sf - Amundsen presentation

Dw 07032018-dr pl pradhan

Intro to Solr in Drupal

Frank Bien Opening Keynote - Join 2016

Elasticsearch - Scalability and Multitenancy

Session #2, tech session: Build realtime search by Sylvain Utard from Algolia

Improve Performance in Fast Search for SharePoint - Comperio

UnderstandingHowSolrCanHelpYourBusinessScale-ECG07.31.2013

Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...

Open Data Summit Presentation by Joe Olsen

Building Scalable Aggregation Systems

Recently uploaded

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

A Domino Admins Adventures (Engage 2024)Gabriella Davis

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Developing An App To Navigate The Roads of BrazilV3cube

Slack Application Development 101 Slidespraypatel2

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

A Call to Action for Generative AI in 2024Results

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Axa Assurance Maroc - Insurer Innovation Award 2024

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

A Domino Admins Adventures (Engage 2024)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Boost PC performance: How more available memory can improve productivity

How to Troubleshoot Apps for the Modern Connected Worker

Data Cloud, More than a CDP by Matt Robison

Developing An App To Navigate The Roads of Brazil

Slack Application Development 101 Slides

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Exploring the Future Potential of AI-Enabled Smartphone Processors

A Call to Action for Generative AI in 2024

08448380779 Call Girls In Friends Colony Women Seeking Men

Presentation on how to chat with PDF using ChatGPT code interpreter

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

2013 11-07 lsr-dublin_m_hausenblas_when solr is best

2. USE CASE DIAGNOSIS: WHEN IS SOLR REALLY THE BEST TOOL? Michael Hausenblas Twitter: @mhausenblas Chief Data Engineer EMEA, MapR Technologies

3. Agenda •  •  •  •  •  Solr in the Big Data ecosystem Polyglot Persistence Common (Big Data) use cases A checklist When not to use Solr …

4. processing storage Apache Pig Apache Zookeeper

5. Polyglot Persistence

6. $ ls -al $ tail –f some.log $ nc localhost 80 awk 'BEGIN { FS = "," } /2013-[[:digit:]]+-[[:digit:]]+/ { print $3 }’ sample.csv tool box one-size-fits-all

7. Polyglot Persistence—Backdrop •  Michael Stonebraker and Ugur Çetintemel—2005 "One Size Fits All": An Idea Whose Time Has Come and Gone •  Martin Fowler—2011 Polyglot Persistence1 •  Eric Brewer—2012 Ricon Keynote—Advancing Distributed Systems2 1) http://martinfowler.com/bliki/PolyglotPersistence.html 2) https://speakerdeck.com/eric_brewer/ricon-2012-keynote

8. Polyglot Persistence—Key Points •  Use different datastores for different needs •  Can apply within an application or cross-enterprise •  Encapsulating data access yields loosely coupled components •  Find sweet spot between dev/op complexity and flexibility

9. Common (Big Data) use cases

10. Where are we coming from? •  •  •  •  •  Keyword search Spellcheck & autosuggest Ranking Faceted search Spatial search

11. Use case: search-based recommendation

12. Search-based recommendation (credit card issuer) •  Given –  customer purchase history –  merchant designations –  merchant special offers •  Goal –  Improve existing recommender system –  Throughput important

13. Analyze with MapReduce complete history Co-‐occurrence (Mahout) Item meta-‐data SolR SolR Solr Indexer Indexer indexing Index shards

14. Deploy with search system user history Web >er Item meta-‐data SolR SolR Solr Indexer Indexer search Index shards

15. Use case: log analysis

16. Log analysis •  Given –  Receive 200,000+ log lines per second •  Goal –  Want to do multi-field search –  Want to search on log lines with <30 second delay before search

17. Data Ingestion and Indexing incoming data Ka@a SolR SolR Text Indexer Indexer analysis Solr indexer Real-‐>me Raw documents Older index shards Live index shard >me-‐sharded Solr indexes

18. Search Query Solr search Web >er SolR SolR Solr Indexer Indexer search Raw documents Older index shards Live index shard

19. A checklist

20. Question you may want to ask … •  What is the volume of your data* (few GB? up to PB?) •  How are your query characteristics? –  full scans –  look-ups –  multiple passes over large parts –  continuous queries •  What’s (more) important: throughput or latency? *) Note: as long as Moore's law s>ll holds, these ﬁgures obviously change on a yearly if not monthly basis.

21. Key qualifiers •  Want exploratory interface rather than aggregates in a dashboard •  Data are sparse symbol sets like words or recommendation indicators •  Small-ish return sets are OK, especially if facets are good enough •  Near-real-time is good enough

22. When not to use Solr …

23. Red Flags •  You need strong consistency? •  JOINS, anyone? •  •  •  reme mber : one ﬁt all size d —too Want (complex) transactions? l belt oes n appr ot oach! OLTP, streaming (but: near-real-time) Graphs?

24. Let’s stay in touch … •  Twitter: @mhausenblas @MapR MapR Nordics MapR UK MapR HQ San Jose, US MapR DACH MapR Japan MapR SE & Benelux MapR Hyderbad •  We’re hiring! MapR Korea

2013 11-07 lsr-dublin_m_hausenblas_when solr is best

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Viewers also liked

Viewers also liked (6)

Similar to 2013 11-07 lsr-dublin_m_hausenblas_when solr is best

Similar to 2013 11-07 lsr-dublin_m_hausenblas_when solr is best (20)

More from lucenerevolution

More from lucenerevolution (20)

Recently uploaded

Recently uploaded (20)

2013 11-07 lsr-dublin_m_hausenblas_when solr is best