Presentation from the Elasticsearch Denver Meetup.
Discusses scaling of Elasticsearch for Related Posts across WordPress.com and some of the big changes that were needed in order to scale for 23 million queries a day across 800 million documents.
15. Bulk Indexing 1.0
44 Days to Index all Posts
(estimated)
Tuesday, February 25, 14
16. Bulk Indexing Problems
- Overhead: Spent too much time starting indexing jobs
WordPress.com has 500 mil MySQL tables.
- High DB Load: Corner Cases. Blogs with 1+ mil
followers.
- High DB Load: Indexing sequentially doesn’t spread
the load.
- High DB Load: Heavy load on archive DBs.
Tuesday, February 25, 14
19. Real Time Goals
1) Eventually Consistent
2) Minimize Bulk Re-indexing
3) Normally updated < 1 minute
Tuesday, February 25, 14
20. Real Time Goals
1) Eventually Consistent
2) Minimize Bulk Re-indexing
3) Normally updated < 1 minute
Bulk reindexed 3 times in 5 months.
One intentional,
Two during system upgrades.
Tuesday, February 25, 14
21. Stuff Fails
1) Humans
2) Hardware
3) Elasticsearch (steady improvements)
Combinations of the above.
Tuesday, February 25, 14
22. Hardware Problems
1) Detect and Track Down Servers
2) Prioritize Queries over Indexing
3) Throttle Indexing Jobs
- any issues: block bulk changes to blogs
- >10 min: block doc updates
- >20 min: block all indexing
Tuesday, February 25, 14
23. Real Time Failures
1) Auto Retry Failed Indexing Jobs
2) Indexing Queue for Failures
3) Scrolling Queries to Find Bad Docs
Tuesday, February 25, 14
27. Our Bulk Indexing Procedure
1) Bulk Index All Docs
2) Optimize the index
3) Rolling Restart (sync segments)
4) Future restarts will be much faster.
- Play with recovery settings
- SSDs? => use noop Linux scheduling
Tuesday, February 25, 14
32. MLT API
1) Get Document
2) Analyze Document
3) Search for Similar Docs
Tuesday, February 25, 14
33. MLT API vs MLT Query
MLT API
MLT Query
147 req/sec
1062 req/sec
40% CPU
30% CPU
306 ms median latency 49.5 ms median latency
All processing by ES
Tuesday, February 25, 14
Build query in PHP
34. Related Posts Relevancy
Great With Long Content
{ "more_like_this":{
"fields":["mlt_content"],
"like_text":"Scaling Elasticsearch Part 1: Overview
ElasticSearch scaling Search We recently launched
Related Posts across WordPress.com, so its time to
pop the hood and take a look at what ended up in
our engine... ",
"percent_terms_to_match":0.08,
"boost_terms":5,
"analyzer": "en_analyzer"
}}
Tuesday, February 25, 14
35. MLT Query Relevancy
Use match or multi_match for
short content.
Average Related Posts CTR
Tuesday, February 25, 14
42. has_parent Filter
Querying Across All Shards
With has_parent
Without has_parent
7.6 req/sec
17.5 req/sec
75% CPU
50% CPU
503 ms median latency 207 ms median latency
Requires more Indexing
Tuesday, February 25, 14