Elasticsearch at Automattic

Greg
Ichneumon
Brown
Data Wrangler at Automattic
http://gibrown.wordpress.com
@gregibrown
greg@automattic.com

Tuesday, February 25, 14

1 Billion Monthly
Uniques


Elasticsearch Deployments
Internal Search
- 216 Internal Blogs - 750k docs [3 GB]
Support Documents
- KNN Link Prediction - 1.7m docs [14 GB]
Polldaddy
- Word Clouds/Freq Response - 39m docs [9 GB]
WordPress.com VIP Search
- KFF.org - 18m docs [99 MB]
- NY Post - 600k docs [2.3 GB]
WordPress.com - ~800m docs [4 TB]
- Related Posts - 48 mil reqs/day
- search.wordpress.com - 3 mil reqs/day

Overview of Related Posts
Our “10X Improvements”
- Indexing
- Querying
Our Open Issues


Related Posts

Search within just the one blog

WordPress.com
Total Elasticsearch Operations

Operation
Routed Queries

23 mil

Global Queries

2 mil

Docs Indexed

13 mil

Docs Updated

10 mil

Docs Deleted

2.5 mil

Delete By Query


Ops/Day

250k

Global Cluster
DC1
1 Master

DC2

DC3
1 Master

14 Data


14 Data

1 Master

14 Data

Our Secret To Scaling
Routed Queries
All Posts for each Blog
are on the same Shard


Global Index

7 Indices
10 mil Blogs per Index
25 Shards per Index
175 Shards Total

20% Improvements
Don’t solve scaling problems


Indexing

Entangling Elasticsearch
with Existing Systems


Bulk Indexing 1.0
44 Days to Index all Posts
(estimated)


Bulk Indexing Problems
- Overhead: Spent too much time starting indexing jobs
WordPress.com has 500 mil MySQL tables.
- High DB Load: Corner Cases. Blogs with 1+ mil
followers.
- High DB Load: Indexing sequentially doesn’t spread
the load.
- High DB Load: Heavy load on archive DBs.


Bulk Indexing Today
12.0?
4 Days to Index all Posts
(running right now)


Real Time Indexing
The Hardest Part!


Real Time Goals
1) Eventually Consistent
2) Minimize Bulk Re-indexing
3) Normally updated < 1 minute


Real Time Goals
1) Eventually Consistent
2) Minimize Bulk Re-indexing
3) Normally updated < 1 minute
Bulk reindexed 3 times in 5 months.
One intentional,
Two during system upgrades.

Stuff Fails
1) Humans
2) Hardware
3) Elasticsearch (steady improvements)
Combinations of the above.


Hardware Problems
1) Detect and Track Down Servers
2) Prioritize Queries over Indexing
3) Throttle Indexing Jobs
- any issues: block bulk changes to blogs
- >10 min: block doc updates
- >20 min: block all indexing

Real Time Failures
1) Auto Retry Failed Indexing Jobs
2) Indexing Queue for Failures
3) Scrolling Queries to Find Bad Docs


Cluster Restarts
Indexing across replicas is
non-deterministic
Segments diverge
Slows Restart Time

Simplistic Example
Docs

Shard 1
merges

Primary

Replica
Segments
w/ identical
checksums


Only ﬁrst
segment is
identical

After Bulk Index
Every segment is
out of sync!


Our Bulk Indexing Procedure
1) Bulk Index All Docs
2) Optimize the index
3) Rolling Restart (sync segments)
4) Future restarts will be much faster.
- Play with recovery settings
- SSDs? => use noop Linux scheduling

Indexing
It’s all about handling Failures


Querying
Test and Iterate


Related Posts Query
Started with MoreLikeThis API.
Did not scale well enough.


MLT API
1) Get Document
2) Analyze Document
3) Search for Similar Docs


MLT API vs MLT Query
MLT API

MLT Query

147 req/sec

1062 req/sec

40% CPU

30% CPU

306 ms median latency 49.5 ms median latency
All processing by ES


Build query in PHP

Related Posts Relevancy
Great With Long Content
{ "more_like_this":{
"ﬁelds":["mlt_content"],
"like_text":"Scaling Elasticsearch Part 1: Overview
ElasticSearch scaling Search We recently launched
Related Posts across WordPress.com, so its time to
pop the hood and take a look at what ended up in
our engine... ",
"percent_terms_to_match":0.08,
"boost_terms":5,
"analyzer": "en_analyzer"
}}

MLT Query Relevancy
Use match or multi_match for
short content.

Average Related Posts CTR

Language Analyzers
arabic, armenian, basque, brazilian, bulgarian,
catalan, chinese, czech, danish, dutch, english,
finnish, french, galician, german, greek, hindi,
hungarian, indonesian, italian, japanese, korean,
norwegian, persian, portuguese, romanian,
russian, spanish, swedish, turkish, thai


How Important is using the
correct Language Analyzer?


How Important is using the
correct Language Analyzer?
Doubled Click Through Rate

Unfortunately
Increased Slow Queries
(>1 second)
by 10x
still worth it.

Global Query Performance
search.wordpress.com


Parent-Child Filtering
Blog Doc
public: true|false
Post Doc
title: “...”
content: “...”


has_parent Filter
Querying Across All Shards
With has_parent

Without has_parent

7.6 req/sec

17.5 req/sec

75% CPU

50% CPU

503 ms median latency 207 ms median latency
Requires more Indexing


Indexing:
Optimize to Handle Failures
Querying:
Test and Iterate

Open Issues
Slow Queries (> 1 second)

Getting Better. Shards are too big.

Open Issues
What does it take to scale?
3x Data
5x Queries


Open Issues
Elasticsearch for Natural
Language Processing?
At Scale.
On Live Data.


http://gibrown.wordpress.com
@gregibrown

Feeling Inspired?
http://automattic.com/work-with-us/data-wrangler/


Elasticsearch at Automattic

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (10)

Similaire à Elasticsearch at Automattic

Similaire à Elasticsearch at Automattic (20)

Dernier

Dernier (20)

Elasticsearch at Automattic