Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damiano Braga & Praneet Mhatre, Trulia

Thoth
Real-time Solr Monitor
Search Analysis Engine
Damiano Braga
Sr. Software Engineer
dbraga@trulia.com
Praneet Mhatre
Data Mining Engineer
pmhatre@trulia.com

Overview
- What is Thoth ?
- Data Collection and Thoth Core Indexing
- Thoth API & Thoth Dashboard
- Thoth Monitor
- Thoth ML : Prediction and Topic Modeling
- Special Thanks & Q/A
Demo

What is Thoth?
- Innovation project at Trulia
- Understand our search infrastructure without touching logs
- Troubleshoot search performance issues
- Designed as a modular system
- Set of tools that can help gather info, monitor, understand a search infrastructure
- Open source project :
thoth
thoth-ml
thoth-api
thoth-dashboard
thoth-monitor
thoth-demo

Problem: Know Your Search Infrastructure
- Solr logs are a good source. Sometimes partial information
- Decentralized data (at least 1 log per search server)
- Log rotation
- Not searchable
If we could index all the information .. Let’s use Solr !
- We can search on it
- We have some handy features for free: facets, stats etc
- It’s scalable

Thoth Document
1 Solr Request = 1 Thoth (Solr) Document
Server Info
hostname, port number, core name, pool name
Query Info
timestamp, actual query, qtime, hits, exception?

Data Collection (1/2)
- Should be smooth. No traffic slowing down.
- We care about near real-time data
- We care about historical data
- Dataset is growing fast
- Interceptor on each search server
- We use a SolrComponent attached to a Request Handler
- Queue System (E.g: ActiveMQ) to facilitate and temporary store messages
- Each search server has a manifest in the solrconfig.xml

Data Collection (2/2)
<requestHandler name="select" class="com.solr2activemq.SolrToActiveMQHandler”>
<arr name="last-components”>
<str>solr2activemq</str>
</arr>
</requestHandler>
<searchComponent name="solr2activemq” class="com.solr2activemq.SolrToActiveMQComponent" >
<str name="activemq-broker-uri">localhost</str>
<int name="activemq-broker-port">61616</int>
<str name="activemq-broker-destination-type">queue</str>
<str name="activemq-broker-destination-name">test-queue</str>
<str name="solr-hostname">localhost</str>
<int name="solr-port">8983</int>
<str name="solr-poolname">default</str>
<str name="solr-corename">collection</str>
<int name="solr2activemq-buffer-size">1000</int>
<int name="solr2activemq-dequeuing-buffer-polling">500</int>
<int name="solr2activemq-check-activemq-polling">5000</int>
</searchComponent>

Sizing of Data
- Need for granular information for near real-time data
- Less granularity for historical data
Too much data = slow search, space problem
- Shrinking feature:
-‐ Create
Shrank
Document
-‐ Real-‐3me
Core
cleanup
- Shrinking time is configurable

Thoth Index
- Solr 4.7
- Soft commit for near real-time search
- Soft commit maxTime set to 1s
- Auto commit set to 15s
- Update chain set to enforce UUID as PkID
- Use of Solrj to index data and query

Thoth API
- Abstraction for Thoth index and Thoth data
- Read only REST-like API
- JSON response
- Written in Node.js to accommodate socket.io
Example:
thoth:3001/api/server/foo/core/bar/port/portbar/start/NOW-‐1DAY/end/NOW/count/nqueries
{"numFound":95,"values":[{"timestamp":"2014-09-16T18:00:02Z","value":45337},
{"timestamp":"2014-09-16T18:15:02Z","value":77325},
{"timestamp":"2014-09-16T19:00:02Z","value":115334}

Thoth Dashboard (1/5)
- Visual insight on Thoth data
- Useful graphs divided by server or pool
- Handy list of slow queries and exceptions
- Real-time view for server
- Selecting data based on time
- Sharable URLs (to OPS team, QA team, Release Eng. )

Thoth Monitor
- Continuously monitoring for metrics
- Stateless
- Alerting through email or Nagios
- Examples: QTime, Number of Zero hits,
Predictor Model Health
- Possibility to implement custom monitors
- Reuse StatsComponent
[http://wiki.apache.org/solr/StatsComponent]
if possible

Thoth ML
What can we do with all this data?
• Rich source of information
• Can we turn it into knowledge?
• How about machine learning?
1.
Query
3me
predic3on
2.
Query
paJern
recogni3on
3.
Server
sizing
and
resource
alloca3on

1. Query Time Prediction (1/4)
• Goal : appropriately route queries to slow/ fast pool
• Look at query attributes
• Query
text
• Start
parameter
• Facets,
range
queries,
geo
spa3al
searches
etc
• Train a supervised learning model
• Use learned model to predict if a query will be slow v/s fast
• H2O Machine Learning Library

Challenges
• Imbalanced dataset
• Frequency of model training
• Type of model
• Minimal delay requirement

Challenges Addressed
• Imbalanced dataset
• Stra3fied
sampling
• Frequency of model training
• Auto
iden3fy
relearning
frequency
• Type of model
• Boolean,
categorical
features
-‐>
Tree
based
• High
accuracy
• Gradient
Boosted
Machine
• Minimal delay requirement
• User
pool
queries:
45-‐50
ms
• Predic3on:
1-‐3
ms

• 1000 Gradient Boosted Trees
• Slow queries = (>100ms. Configurable)
• Experimental Results
• Training
on
~3.1
million
• Test
on
~1.4
million
• AUC:
0.94542
• Accuracy:
0.9202223

Query Time Prediction in Action (1/2)
Performance on real time traffic at Trulia

Query Time Prediction in Action (2/2)
Performance on real time traffic at Trulia

2. Query Pattern Recognition
• Exceptions, zero hit queries
• Analyze and find out why
• Probabilistic Topic Modeling
• Using MALLET open source toolkit

Future Direction
- Thoth ML improvements:
• Predic3ng
query
3me
buckets
• Regression
v/s
classifica3on
• Excep3ons
and
zero
hit
query
analysis
• Sizing
and
resource
alloca3on
- Solr Cloud integration
- Dashboard integration with Solr cloud
- More standard metrics on Thoth Monitor
- More data collection (load, GC)

Contributors and Special Thanks
Damiano : dbraga@trulia.com
Praneet: pmhatre@trulia.com
Fork us on Github!
github.com/trulia/thoth
JD Cantrell ( API, Dashboard)
Giulio Grillanda (API, Dashboard)
Rajendra Shioramwar (Core)
Ying Wang (Design)
Girish Gudla (Monitor)
Alexander Kanarsky
Alex Burmester

Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damiano Braga & Praneet Mhatre, Trulia

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (11)

Similaire à Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damiano Braga & Praneet Mhatre, Trulia

Similaire à Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damiano Braga & Praneet Mhatre, Trulia (20)

Plus de Lucidworks

Plus de Lucidworks (20)

Dernier

Dernier (20)

Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damiano Braga & Praneet Mhatre, Trulia