SlideShare une entreprise Scribd logo
1  sur  48
Télécharger pour lire hors ligne
at

Tuesday, February 25, 14
Greg
Ichneumon
Brown
Data Wrangler at Automattic
http://gibrown.wordpress.com
@gregibrown
greg@automattic.com

Tuesday, February 25, 14
Tuesday, February 25, 14
1 Billion Monthly
Uniques

Tuesday, February 25, 14
Elasticsearch Deployments
Internal Search
- 216 Internal Blogs - 750k docs [3 GB]
Support Documents
- KNN Link Prediction - 1.7m docs [14 GB]
Polldaddy
- Word Clouds/Freq Response - 39m docs [9 GB]
WordPress.com VIP Search
- KFF.org - 18m docs [99 MB]
- NY Post - 600k docs [2.3 GB]
WordPress.com - ~800m docs [4 TB]
- Related Posts - 48 mil reqs/day
- search.wordpress.com - 3 mil reqs/day
Tuesday, February 25, 14
Overview of Related Posts
Our “10X Improvements”
- Indexing
- Querying
Our Open Issues

Tuesday, February 25, 14
Related Posts

Search within just the one blog
Tuesday, February 25, 14
WordPress.com
Total Elasticsearch Operations

Operation
Routed Queries

23 mil

Global Queries

2 mil

Docs Indexed

13 mil

Docs Updated

10 mil

Docs Deleted

2.5 mil

Delete By Query

Tuesday, February 25, 14

Ops/Day

250k
Global Cluster
DC1
1 Master

DC2

DC3
1 Master

14 Data

Tuesday, February 25, 14

14 Data

1 Master

14 Data
Our Secret To Scaling
Routed Queries
All Posts for each Blog
are on the same Shard

Tuesday, February 25, 14
Global Index

7 Indices
10 mil Blogs per Index
25 Shards per Index
175 Shards Total
Tuesday, February 25, 14
Overview of Related Posts
Our “10X Improvements”
- Indexing
- Querying
Our Open Issues

Tuesday, February 25, 14
20% Improvements
Don’t solve scaling problems

Tuesday, February 25, 14
Indexing

Entangling Elasticsearch
with Existing Systems

Tuesday, February 25, 14
Bulk Indexing 1.0
44 Days to Index all Posts
(estimated)

Tuesday, February 25, 14
Bulk Indexing Problems
- Overhead: Spent too much time starting indexing jobs
WordPress.com has 500 mil MySQL tables.
- High DB Load: Corner Cases. Blogs with 1+ mil
followers.
- High DB Load: Indexing sequentially doesn’t spread
the load.
- High DB Load: Heavy load on archive DBs.

Tuesday, February 25, 14
Bulk Indexing Today
12.0?
4 Days to Index all Posts
(running right now)

Tuesday, February 25, 14
Real Time Indexing
The Hardest Part!

Tuesday, February 25, 14
Real Time Goals
1) Eventually Consistent
2) Minimize Bulk Re-indexing
3) Normally updated < 1 minute

Tuesday, February 25, 14
Real Time Goals
1) Eventually Consistent
2) Minimize Bulk Re-indexing
3) Normally updated < 1 minute
Bulk reindexed 3 times in 5 months.
One intentional,
Two during system upgrades.
Tuesday, February 25, 14
Stuff Fails
1) Humans
2) Hardware
3) Elasticsearch (steady improvements)
Combinations of the above.

Tuesday, February 25, 14
Hardware Problems
1) Detect and Track Down Servers
2) Prioritize Queries over Indexing
3) Throttle Indexing Jobs
- any issues: block bulk changes to blogs
- >10 min: block doc updates
- >20 min: block all indexing
Tuesday, February 25, 14
Real Time Failures
1) Auto Retry Failed Indexing Jobs
2) Indexing Queue for Failures
3) Scrolling Queries to Find Bad Docs

Tuesday, February 25, 14
Cluster Restarts
Indexing across replicas is
non-deterministic
Segments diverge
Slows Restart Time
Tuesday, February 25, 14
Simplistic Example
Docs

Shard 1
merges

Primary

Replica
Segments
w/ identical
checksums

Tuesday, February 25, 14

Only first
segment is
identical
After Bulk Index
Every segment is
out of sync!

Tuesday, February 25, 14
Our Bulk Indexing Procedure
1) Bulk Index All Docs
2) Optimize the index
3) Rolling Restart (sync segments)
4) Future restarts will be much faster.
- Play with recovery settings
- SSDs? => use noop Linux scheduling
Tuesday, February 25, 14
Indexing
It’s all about handling Failures

Tuesday, February 25, 14
Overview of Related Posts
Our “10X Improvements”
- Indexing
- Querying
Our Open Issues

Tuesday, February 25, 14
Querying
Test and Iterate

Tuesday, February 25, 14
Related Posts Query
Started with MoreLikeThis API.
Did not scale well enough.

Tuesday, February 25, 14
MLT API
1) Get Document
2) Analyze Document
3) Search for Similar Docs

Tuesday, February 25, 14
MLT API vs MLT Query
MLT API

MLT Query

147 req/sec

1062 req/sec

40% CPU

30% CPU

306 ms median latency 49.5 ms median latency
All processing by ES

Tuesday, February 25, 14

Build query in PHP
Related Posts Relevancy
Great With Long Content
{ "more_like_this":{
"fields":["mlt_content"],
"like_text":"Scaling Elasticsearch Part 1: Overview
ElasticSearch scaling Search We recently launched
Related Posts across WordPress.com, so its time to
pop the hood and take a look at what ended up in
our engine... ",
"percent_terms_to_match":0.08,
"boost_terms":5,
"analyzer": "en_analyzer"
}}
Tuesday, February 25, 14
MLT Query Relevancy
Use match or multi_match for
short content.

Average Related Posts CTR
Tuesday, February 25, 14
Language Analyzers
arabic, armenian, basque, brazilian, bulgarian,
catalan, chinese, czech, danish, dutch, english,
finnish, french, galician, german, greek, hindi,
hungarian, indonesian, italian, japanese, korean,
norwegian, persian, portuguese, romanian,
russian, spanish, swedish, turkish, thai

Tuesday, February 25, 14
Related Posts Relevancy
How Important is using the
correct Language Analyzer?

Tuesday, February 25, 14
Related Posts Relevancy
How Important is using the
correct Language Analyzer?
Doubled Click Through Rate
Tuesday, February 25, 14
Unfortunately
Increased Slow Queries
(>1 second)
by 10x
still worth it.
Tuesday, February 25, 14
Global Query Performance
search.wordpress.com

Tuesday, February 25, 14
Parent-Child Filtering
Blog Doc
public: true|false
Post Doc
title: “...”
content: “...”

Tuesday, February 25, 14
has_parent Filter
Querying Across All Shards
With has_parent

Without has_parent

7.6 req/sec

17.5 req/sec

75% CPU

50% CPU

503 ms median latency 207 ms median latency
Requires more Indexing

Tuesday, February 25, 14
Indexing:
Optimize to Handle Failures
Querying:
Test and Iterate
Tuesday, February 25, 14
Overview of Related Posts
Our “10X Improvements”
- Indexing
- Querying
Our Open Issues

Tuesday, February 25, 14
Open Issues
Slow Queries (> 1 second)

Getting Better. Shards are too big.
Tuesday, February 25, 14
Open Issues
What does it take to scale?
3x Data
5x Queries

Tuesday, February 25, 14
Open Issues
Elasticsearch for Natural
Language Processing?
At Scale.
On Live Data.

Tuesday, February 25, 14
http://gibrown.wordpress.com
@gregibrown

Feeling Inspired?
http://automattic.com/work-with-us/data-wrangler/

Tuesday, February 25, 14

Contenu connexe

Tendances

Effective Searching by Dominik Kornas
Effective Searching by Dominik KornasEffective Searching by Dominik Kornas
Effective Searching by Dominik Kornas
AEM HUB
 
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Sematext Group, Inc.
 

Tendances (20)

Pragmatic REST APIs
Pragmatic REST APIsPragmatic REST APIs
Pragmatic REST APIs
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use cases
 
Pragmatic REST: recent trends in API design
Pragmatic REST: recent trends in API designPragmatic REST: recent trends in API design
Pragmatic REST: recent trends in API design
 
Building a mini-theme with WordPress REST API
Building a mini-theme with WordPress REST APIBuilding a mini-theme with WordPress REST API
Building a mini-theme with WordPress REST API
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Effective Searching by Dominik Kornas
Effective Searching by Dominik KornasEffective Searching by Dominik Kornas
Effective Searching by Dominik Kornas
 
DMCA#21: reactive-programming
DMCA#21: reactive-programmingDMCA#21: reactive-programming
DMCA#21: reactive-programming
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
Using server logs to your advantage
Using server logs to your advantageUsing server logs to your advantage
Using server logs to your advantage
 
Implementing Site Search in CQ5 / AEM
Implementing Site Search in CQ5 / AEMImplementing Site Search in CQ5 / AEM
Implementing Site Search in CQ5 / AEM
 
Server Logs: After Excel Fails
Server Logs: After Excel FailsServer Logs: After Excel Fails
Server Logs: After Excel Fails
 
Xapian vs sphinx
Xapian vs sphinxXapian vs sphinx
Xapian vs sphinx
 
Drupal meets PostgreSQL for DrupalCamp MSK 2014
Drupal meets PostgreSQL for DrupalCamp MSK 2014Drupal meets PostgreSQL for DrupalCamp MSK 2014
Drupal meets PostgreSQL for DrupalCamp MSK 2014
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
 
Building Search Engines - Lucene, SolR and Elasticsearch
Building Search Engines - Lucene, SolR and ElasticsearchBuilding Search Engines - Lucene, SolR and Elasticsearch
Building Search Engines - Lucene, SolR and Elasticsearch
 
Search query assistance. Autosuggestion
Search query assistance. AutosuggestionSearch query assistance. Autosuggestion
Search query assistance. Autosuggestion
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
Building Serverless GraphQL Backends
Building Serverless GraphQL BackendsBuilding Serverless GraphQL Backends
Building Serverless GraphQL Backends
 
MongoDB Atlas for Your Enterprise
MongoDB Atlas for Your EnterpriseMongoDB Atlas for Your Enterprise
MongoDB Atlas for Your Enterprise
 

En vedette

Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with rails
Tom Z Zeng
 

En vedette (10)

Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with rails
 
Modernizing WordPress Search with Elasticsearch
Modernizing WordPress Search with ElasticsearchModernizing WordPress Search with Elasticsearch
Modernizing WordPress Search with Elasticsearch
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.com
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep dive
 
How Did BuzzFeed Harvest One Million Email Subscribers?
How Did BuzzFeed Harvest One Million Email Subscribers?How Did BuzzFeed Harvest One Million Email Subscribers?
How Did BuzzFeed Harvest One Million Email Subscribers?
 
ELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learned
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - Sematext
 
A Tour of Google Cloud Platform
A Tour of Google Cloud PlatformA Tour of Google Cloud Platform
A Tour of Google Cloud Platform
 

Similaire à Elasticsearch at Automattic

Lessons Learned from Building SW at Google
Lessons Learned from Building SW at GoogleLessons Learned from Building SW at Google
Lessons Learned from Building SW at Google
adrianionel
 

Similaire à Elasticsearch at Automattic (20)

Secrets of Enterprise Data Mining 201305
Secrets of Enterprise Data Mining 201305Secrets of Enterprise Data Mining 201305
Secrets of Enterprise Data Mining 201305
 
Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at Vinted
 
ML Meetup #27 - Data Infrasctructure and Data Access in Nubank
ML Meetup #27 - Data Infrasctructure and Data Access in NubankML Meetup #27 - Data Infrasctructure and Data Access in Nubank
ML Meetup #27 - Data Infrasctructure and Data Access in Nubank
 
Improving Reporting Performance
Improving Reporting PerformanceImproving Reporting Performance
Improving Reporting Performance
 
B365 saturday practical guide to building a scalable search architecture in s...
B365 saturday practical guide to building a scalable search architecture in s...B365 saturday practical guide to building a scalable search architecture in s...
B365 saturday practical guide to building a scalable search architecture in s...
 
SharePoint Search Topology and Optimization
SharePoint Search Topology and OptimizationSharePoint Search Topology and Optimization
SharePoint Search Topology and Optimization
 
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesSynchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web Pages
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
 
Search Topology and Optimization
Search Topology and OptimizationSearch Topology and Optimization
Search Topology and Optimization
 
Recommender Systems from A to Z – Real-Time Deployment
Recommender Systems from A to Z – Real-Time DeploymentRecommender Systems from A to Z – Real-Time Deployment
Recommender Systems from A to Z – Real-Time Deployment
 
Scaling Pinterest
Scaling PinterestScaling Pinterest
Scaling Pinterest
 
Elasticsearch : petit déjeuner du 13 mars 2014
Elasticsearch : petit déjeuner du 13 mars 2014Elasticsearch : petit déjeuner du 13 mars 2014
Elasticsearch : petit déjeuner du 13 mars 2014
 
Tridion Content Broker - how and why we are using it at the RSPB (2007)
Tridion Content Broker - how and why we are using it at the RSPB (2007)Tridion Content Broker - how and why we are using it at the RSPB (2007)
Tridion Content Broker - how and why we are using it at the RSPB (2007)
 
Improve Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - ComperioImprove Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - Comperio
 
Building multi billion ( dollars, users, documents ) search engines on open ...
Building multi billion ( dollars, users, documents ) search engines  on open ...Building multi billion ( dollars, users, documents ) search engines  on open ...
Building multi billion ( dollars, users, documents ) search engines on open ...
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
 
Building a scalable search architecture in share point 2013
Building a scalable search architecture in share point 2013Building a scalable search architecture in share point 2013
Building a scalable search architecture in share point 2013
 
SEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideSEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech Side
 
Lessons Learned from Building SW at Google
Lessons Learned from Building SW at GoogleLessons Learned from Building SW at Google
Lessons Learned from Building SW at Google
 
Building Software Systems at Google and Lessons Learned
Building Software Systems at Google and Lessons LearnedBuilding Software Systems at Google and Lessons Learned
Building Software Systems at Google and Lessons Learned
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Elasticsearch at Automattic