SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
Thoth 
Real-time Solr Monitor 
Search Analysis Engine 
Damiano Braga 
Sr. Software Engineer 
dbraga@trulia.com 
Praneet Mhatre 
Data Mining Engineer 
pmhatre@trulia.com
Overview 
- What is Thoth ? 
- Data Collection and Thoth Core Indexing 
- Thoth API & Thoth Dashboard 
- Thoth Monitor 
- Thoth ML : Prediction and Topic Modeling 
- Special Thanks & Q/A 
Demo
What is Thoth? 
- Innovation project at Trulia 
- Understand our search infrastructure without touching logs 
- Troubleshoot search performance issues 
- Designed as a modular system 
- Set of tools that can help gather info, monitor, understand a search infrastructure 
- Open source project : 
thoth 
thoth-ml 
thoth-api 
thoth-dashboard 
thoth-monitor 
thoth-demo
Problem: Know Your Search Infrastructure 
- Solr logs are a good source. Sometimes partial information 
- Decentralized data (at least 1 log per search server) 
- Log rotation 
- Not searchable 
If we could index all the information .. Let’s use Solr ! 
- We can search on it 
- We have some handy features for free: facets, stats etc 
- It’s scalable
Thoth Document 
1 Solr Request = 1 Thoth (Solr) Document 
Server Info 
hostname, port number, core name, pool name 
Query Info 
timestamp, actual query, qtime, hits, exception?
Data Collection (1/2) 
- Should be smooth. No traffic slowing down. 
- We care about near real-time data 
- We care about historical data 
- Dataset is growing fast 
- Interceptor on each search server 
- We use a SolrComponent attached to a Request Handler 
- Queue System (E.g: ActiveMQ) to facilitate and temporary store messages 
- Each search server has a manifest in the solrconfig.xml
Data Collection (2/2) 
<requestHandler name="select" class="com.solr2activemq.SolrToActiveMQHandler”> 
<arr name="last-components”> 
<str>solr2activemq</str> 
</arr> 
</requestHandler> 
<searchComponent name="solr2activemq” class="com.solr2activemq.SolrToActiveMQComponent" > 
<str name="activemq-broker-uri">localhost</str> 
<int name="activemq-broker-port">61616</int> 
<str name="activemq-broker-destination-type">queue</str> 
<str name="activemq-broker-destination-name">test-queue</str> 
<str name="solr-hostname">localhost</str> 
<int name="solr-port">8983</int> 
<str name="solr-poolname">default</str> 
<str name="solr-corename">collection</str> 
<int name="solr2activemq-buffer-size">1000</int> 
<int name="solr2activemq-dequeuing-buffer-polling">500</int> 
<int name="solr2activemq-check-activemq-polling">5000</int> 
</searchComponent>
Sizing of Data 
- Need for granular information for near real-time data 
- Less granularity for historical data 
Too much data = slow search, space problem 
- Shrinking feature: 
-­‐ Create 
Shrank 
Document 
-­‐ Real-­‐3me 
Core 
cleanup 
- Shrinking time is configurable
Thoth Index 
- Solr 4.7 
- Soft commit for near real-time search 
- Soft commit maxTime set to 1s 
- Auto commit set to 15s 
- Update chain set to enforce UUID as PkID 
- Use of Solrj to index data and query
Thoth API 
- Abstraction for Thoth index and Thoth data 
- Read only REST-like API 
- JSON response 
- Written in Node.js to accommodate socket.io 
Example: 
thoth:3001/api/server/foo/core/bar/port/portbar/start/NOW-­‐1DAY/end/NOW/count/nqueries 
{"numFound":95,"values":[{"timestamp":"2014-09-16T18:00:02Z","value":45337}, 
{"timestamp":"2014-09-16T18:15:02Z","value":77325}, 
{"timestamp":"2014-09-16T18:30:02Z","value":109523}, 
{"timestamp":"2014-09-16T18:45:02Z","value":112279}, 
{"timestamp":"2014-09-16T19:00:02Z","value":115334}
Thoth Dashboard (1/5) 
- Visual insight on Thoth data 
- Useful graphs divided by server or pool 
- Handy list of slow queries and exceptions 
- Real-time view for server 
- Selecting data based on time 
- Sharable URLs (to OPS team, QA team, Release Eng. )
Thoth Dashboard (2/5)
Thoth Dashboard (3/5)
Thoth Dashboard (4/5)
Thoth Dashboard (5/5)
Thoth Monitor 
- Continuously monitoring for metrics 
- Stateless 
- Alerting through email or Nagios 
- Examples: QTime, Number of Zero hits, 
Predictor Model Health 
- Possibility to implement custom monitors 
- Reuse StatsComponent 
[http://wiki.apache.org/solr/StatsComponent] 
if possible
Thoth ML 
What can we do with all this data? 
• Rich source of information 
• Can we turn it into knowledge? 
• How about machine learning? 
1. 
Query 
3me 
predic3on 
2. 
Query 
paJern 
recogni3on 
3. 
Server 
sizing 
and 
resource 
alloca3on
1. Query Time Prediction (1/4) 
• Goal : appropriately route queries to slow/ fast pool 
• Look at query attributes 
• Query 
text 
• Start 
parameter 
• Facets, 
range 
queries, 
geo 
spa3al 
searches 
etc 
• Train a supervised learning model 
• Use learned model to predict if a query will be slow v/s fast 
• H2O Machine Learning Library
1. Query Time Prediction (2/4) 
Challenges 
• Imbalanced dataset 
• Frequency of model training 
• Type of model 
• Minimal delay requirement
1. Query Time Prediction (3/4) 
Challenges Addressed 
• Imbalanced dataset 
• Stra3fied 
sampling 
• Frequency of model training 
• Auto 
iden3fy 
relearning 
frequency 
• Type of model 
• Boolean, 
categorical 
features 
-­‐> 
Tree 
based 
• High 
accuracy 
• Gradient 
Boosted 
Machine 
• Minimal delay requirement 
• User 
pool 
queries: 
45-­‐50 
ms 
• Predic3on: 
1-­‐3 
ms
1. Query Time Prediction (4/4) 
• 1000 Gradient Boosted Trees 
• Slow queries = (>100ms. Configurable) 
• Experimental Results 
• Training 
on 
~3.1 
million 
• Test 
on 
~1.4 
million 
• AUC: 
0.94542 
• Accuracy: 
0.9202223
Query Time Prediction in Action (1/2) 
Performance on real time traffic at Trulia
Query Time Prediction in Action (2/2) 
Performance on real time traffic at Trulia
2. Query Pattern Recognition 
• Exceptions, zero hit queries 
• Analyze and find out why 
• Probabilistic Topic Modeling 
• Using MALLET open source toolkit
Topic Modeling Flow
Topics With Keywords
Future Direction 
- Thoth ML improvements: 
• Predic3ng 
query 
3me 
buckets 
• Regression 
v/s 
classifica3on 
• Excep3ons 
and 
zero 
hit 
query 
analysis 
• Sizing 
and 
resource 
alloca3on 
- Solr Cloud integration 
- Dashboard integration with Solr cloud 
- More standard metrics on Thoth Monitor 
- More data collection (load, GC)
Contributors and Special Thanks 
Damiano : dbraga@trulia.com 
Praneet: pmhatre@trulia.com 
Fork us on Github! 
github.com/trulia/thoth 
JD Cantrell ( API, Dashboard) 
Giulio Grillanda (API, Dashboard) 
Rajendra Shioramwar (Core) 
Ying Wang (Design) 
Girish Gudla (Monitor) 
Alexander Kanarsky 
Alex Burmester

Contenu connexe

Tendances

Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lucidworks
 
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, AlfrescoParallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Lucidworks
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
Lucidworks
 
Understanding Lucene Search Performance
Understanding Lucene Search PerformanceUnderstanding Lucene Search Performance
Understanding Lucene Search Performance
Lucidworks (Archived)
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Lucidworks
 

Tendances (20)

H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
 
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
 
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, AlfrescoParallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
 
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
 
Twitter Search Architecture
Twitter Search Architecture Twitter Search Architecture
Twitter Search Architecture
 
Understanding Lucene Search Performance
Understanding Lucene Search PerformanceUnderstanding Lucene Search Performance
Understanding Lucene Search Performance
 
Expand data analysis tool at scale with Zeppelin
Expand data analysis tool at scale with ZeppelinExpand data analysis tool at scale with Zeppelin
Expand data analysis tool at scale with Zeppelin
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and Elasticsearch
 
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
Why Is My Solr Slow?: Presented by Mike Drob, ClouderaWhy Is My Solr Slow?: Presented by Mike Drob, Cloudera
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
Searching for Better Code: Presented by Grant Ingersoll, LucidworksSearching for Better Code: Presented by Grant Ingersoll, Lucidworks
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
 

En vedette

Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, RocanaSolr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Lucidworks
 
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Lucidworks
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 

En vedette (11)

Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, RocanaSolr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
 
Start Your Search Engines: Optimizing Solr to Improve Results
Start Your Search Engines: Optimizing Solr to Improve ResultsStart Your Search Engines: Optimizing Solr to Improve Results
Start Your Search Engines: Optimizing Solr to Improve Results
 
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
 
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksVisualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
 
Solr for Indexing and Searching Logs
Solr for Indexing and Searching LogsSolr for Indexing and Searching Logs
Solr for Indexing and Searching Logs
 
Formación apache Solr
Formación apache SolrFormación apache Solr
Formación apache Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similaire à Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damiano Braga & Praneet Mhatre, Trulia

Visual Studio 2013 Profiling
Visual Studio 2013 ProfilingVisual Studio 2013 Profiling
Visual Studio 2013 Profiling
Denis Dudaev
 

Similaire à Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damiano Braga & Praneet Mhatre, Trulia (20)

Thoth - Realtime Solr Monitor and Search Analysis Engine
Thoth - Realtime Solr Monitor and Search Analysis EngineThoth - Realtime Solr Monitor and Search Analysis Engine
Thoth - Realtime Solr Monitor and Search Analysis Engine
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Sumo Logic QuickStart Webinar
Sumo Logic QuickStart WebinarSumo Logic QuickStart Webinar
Sumo Logic QuickStart Webinar
 
Welcome Webinar Slides
Welcome Webinar SlidesWelcome Webinar Slides
Welcome Webinar Slides
 
Sumo Logic QuickStart Webinar - Dec 2016
Sumo Logic QuickStart Webinar - Dec 2016Sumo Logic QuickStart Webinar - Dec 2016
Sumo Logic QuickStart Webinar - Dec 2016
 
Sumo Logic Quickstart - Jan 2017
Sumo Logic Quickstart - Jan 2017Sumo Logic Quickstart - Jan 2017
Sumo Logic Quickstart - Jan 2017
 
Sumo Logic Quickstart - Nv 2016
Sumo Logic Quickstart - Nv 2016Sumo Logic Quickstart - Nv 2016
Sumo Logic Quickstart - Nv 2016
 
Sumo Logic QuickStat - Apr 2017
Sumo Logic QuickStat - Apr 2017Sumo Logic QuickStat - Apr 2017
Sumo Logic QuickStat - Apr 2017
 
Sumo Logic QuickStart Webinar Oct 2016
Sumo Logic QuickStart Webinar Oct 2016Sumo Logic QuickStart Webinar Oct 2016
Sumo Logic QuickStart Webinar Oct 2016
 
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
 
Apache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdbApache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdb
 
Sumo Logic Quickstart Training 10/14/2015
Sumo Logic Quickstart Training 10/14/2015Sumo Logic Quickstart Training 10/14/2015
Sumo Logic Quickstart Training 10/14/2015
 
Setting Up Sumo Logic - Apr 2017
Setting Up Sumo Logic - Apr 2017Setting Up Sumo Logic - Apr 2017
Setting Up Sumo Logic - Apr 2017
 
Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014
 
Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @Facebook
 
Sumo Logic QuickStart Webinar July 2016
Sumo Logic QuickStart Webinar July 2016Sumo Logic QuickStart Webinar July 2016
Sumo Logic QuickStart Webinar July 2016
 
Visual Studio 2013 Profiling
Visual Studio 2013 ProfilingVisual Studio 2013 Profiling
Visual Studio 2013 Profiling
 
MongoDB for Time Series Data
MongoDB for Time Series DataMongoDB for Time Series Data
MongoDB for Time Series Data
 

Plus de Lucidworks

Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 

Plus de Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Dernier

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 

Dernier (20)

WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 

Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damiano Braga & Praneet Mhatre, Trulia

  • 1.
  • 2. Thoth Real-time Solr Monitor Search Analysis Engine Damiano Braga Sr. Software Engineer dbraga@trulia.com Praneet Mhatre Data Mining Engineer pmhatre@trulia.com
  • 3. Overview - What is Thoth ? - Data Collection and Thoth Core Indexing - Thoth API & Thoth Dashboard - Thoth Monitor - Thoth ML : Prediction and Topic Modeling - Special Thanks & Q/A Demo
  • 4. What is Thoth? - Innovation project at Trulia - Understand our search infrastructure without touching logs - Troubleshoot search performance issues - Designed as a modular system - Set of tools that can help gather info, monitor, understand a search infrastructure - Open source project : thoth thoth-ml thoth-api thoth-dashboard thoth-monitor thoth-demo
  • 5. Problem: Know Your Search Infrastructure - Solr logs are a good source. Sometimes partial information - Decentralized data (at least 1 log per search server) - Log rotation - Not searchable If we could index all the information .. Let’s use Solr ! - We can search on it - We have some handy features for free: facets, stats etc - It’s scalable
  • 6. Thoth Document 1 Solr Request = 1 Thoth (Solr) Document Server Info hostname, port number, core name, pool name Query Info timestamp, actual query, qtime, hits, exception?
  • 7. Data Collection (1/2) - Should be smooth. No traffic slowing down. - We care about near real-time data - We care about historical data - Dataset is growing fast - Interceptor on each search server - We use a SolrComponent attached to a Request Handler - Queue System (E.g: ActiveMQ) to facilitate and temporary store messages - Each search server has a manifest in the solrconfig.xml
  • 8. Data Collection (2/2) <requestHandler name="select" class="com.solr2activemq.SolrToActiveMQHandler”> <arr name="last-components”> <str>solr2activemq</str> </arr> </requestHandler> <searchComponent name="solr2activemq” class="com.solr2activemq.SolrToActiveMQComponent" > <str name="activemq-broker-uri">localhost</str> <int name="activemq-broker-port">61616</int> <str name="activemq-broker-destination-type">queue</str> <str name="activemq-broker-destination-name">test-queue</str> <str name="solr-hostname">localhost</str> <int name="solr-port">8983</int> <str name="solr-poolname">default</str> <str name="solr-corename">collection</str> <int name="solr2activemq-buffer-size">1000</int> <int name="solr2activemq-dequeuing-buffer-polling">500</int> <int name="solr2activemq-check-activemq-polling">5000</int> </searchComponent>
  • 9. Sizing of Data - Need for granular information for near real-time data - Less granularity for historical data Too much data = slow search, space problem - Shrinking feature: -­‐ Create Shrank Document -­‐ Real-­‐3me Core cleanup - Shrinking time is configurable
  • 10. Thoth Index - Solr 4.7 - Soft commit for near real-time search - Soft commit maxTime set to 1s - Auto commit set to 15s - Update chain set to enforce UUID as PkID - Use of Solrj to index data and query
  • 11. Thoth API - Abstraction for Thoth index and Thoth data - Read only REST-like API - JSON response - Written in Node.js to accommodate socket.io Example: thoth:3001/api/server/foo/core/bar/port/portbar/start/NOW-­‐1DAY/end/NOW/count/nqueries {"numFound":95,"values":[{"timestamp":"2014-09-16T18:00:02Z","value":45337}, {"timestamp":"2014-09-16T18:15:02Z","value":77325}, {"timestamp":"2014-09-16T18:30:02Z","value":109523}, {"timestamp":"2014-09-16T18:45:02Z","value":112279}, {"timestamp":"2014-09-16T19:00:02Z","value":115334}
  • 12. Thoth Dashboard (1/5) - Visual insight on Thoth data - Useful graphs divided by server or pool - Handy list of slow queries and exceptions - Real-time view for server - Selecting data based on time - Sharable URLs (to OPS team, QA team, Release Eng. )
  • 17. Thoth Monitor - Continuously monitoring for metrics - Stateless - Alerting through email or Nagios - Examples: QTime, Number of Zero hits, Predictor Model Health - Possibility to implement custom monitors - Reuse StatsComponent [http://wiki.apache.org/solr/StatsComponent] if possible
  • 18. Thoth ML What can we do with all this data? • Rich source of information • Can we turn it into knowledge? • How about machine learning? 1. Query 3me predic3on 2. Query paJern recogni3on 3. Server sizing and resource alloca3on
  • 19. 1. Query Time Prediction (1/4) • Goal : appropriately route queries to slow/ fast pool • Look at query attributes • Query text • Start parameter • Facets, range queries, geo spa3al searches etc • Train a supervised learning model • Use learned model to predict if a query will be slow v/s fast • H2O Machine Learning Library
  • 20. 1. Query Time Prediction (2/4) Challenges • Imbalanced dataset • Frequency of model training • Type of model • Minimal delay requirement
  • 21. 1. Query Time Prediction (3/4) Challenges Addressed • Imbalanced dataset • Stra3fied sampling • Frequency of model training • Auto iden3fy relearning frequency • Type of model • Boolean, categorical features -­‐> Tree based • High accuracy • Gradient Boosted Machine • Minimal delay requirement • User pool queries: 45-­‐50 ms • Predic3on: 1-­‐3 ms
  • 22. 1. Query Time Prediction (4/4) • 1000 Gradient Boosted Trees • Slow queries = (>100ms. Configurable) • Experimental Results • Training on ~3.1 million • Test on ~1.4 million • AUC: 0.94542 • Accuracy: 0.9202223
  • 23. Query Time Prediction in Action (1/2) Performance on real time traffic at Trulia
  • 24. Query Time Prediction in Action (2/2) Performance on real time traffic at Trulia
  • 25. 2. Query Pattern Recognition • Exceptions, zero hit queries • Analyze and find out why • Probabilistic Topic Modeling • Using MALLET open source toolkit
  • 28. Future Direction - Thoth ML improvements: • Predic3ng query 3me buckets • Regression v/s classifica3on • Excep3ons and zero hit query analysis • Sizing and resource alloca3on - Solr Cloud integration - Dashboard integration with Solr cloud - More standard metrics on Thoth Monitor - More data collection (load, GC)
  • 29. Contributors and Special Thanks Damiano : dbraga@trulia.com Praneet: pmhatre@trulia.com Fork us on Github! github.com/trulia/thoth JD Cantrell ( API, Dashboard) Giulio Grillanda (API, Dashboard) Rajendra Shioramwar (Core) Ying Wang (Design) Girish Gudla (Monitor) Alexander Kanarsky Alex Burmester