2. BACKGROUND
• FRED DE VILLAMIL, 38 ANS, DIRECTOR OF INFRASTRUCTURE
@SYNTHESIO
• LINUX / (FREE)BSD SINCE 1996
• OPEN SOURCE CONTRIBUTOR SINCE 1998
• RUNS ELASTICSEARCH IN PRODUCTION SINCE 0.17.6
3. ABOUT SYNTHESIO
• Synthesio is the leading social intelligence tool for
social media monitoring & social analytics.
• Synthesio crawls the Web for relevant data,
enriches it with sentiment analysis and
demographics to build social analytics dashboards.
4. ELASTICSEARCH @SYNTHESIO, SEPTEMBER 2016
• 5 clusters, 163 physical servers, 400TB storage,
10.2TB RAM
• 75B indexed documents, 200TB data
• 1.8B indexed documents each month: mix of Web
pages, forums and social media posts
6. DEC. 2014: THE MYSQL NIGHTMARE
• Cross clusters queries on 3 massive Galera clusters
• Up to 50M rows fetched from a massive 4B rows
reference table
• Then a cross cluster joint on a 20TB, 35B records
monolithic MySQL database
• Poor performances, frequent timeouts
7. JAN. 2015: CLIPPING REVOLUTION
• 1 global index, 512 shards, 5B documents
• 1000 new documents / second
• 47 servers running ElasticSearch 1.3.2 then 1.3.9
• Capacity : 37TB, 24TB data, 2.62TB RAM
9. CLIPPING REVOLUTION DATA MODEL
• ROUTING ON A MONTHLY
BASIS
• EACH CRAWLED DOCUMENT
IS INDEXED WITH NESTED
DASHBOARD IDS.
• QUERIES ON TIME PERIOD
+ DASHBOARD ID
{ "document": {
"dashboards": {
"dashboard_id": 1,
"dashboard_id": 2
}
}
}
10. PROBLEMS
• TOO MANY SHARDS (WAS MEANT TO BE A WEEKLY ROUTING)
• 500GB TO 900GB SHARDS (!!!) GROWING AFTER THE
MONTH IS OVER. 3 HOURS FOR A REALLOCATION
• A ROLLING RESTART TAKES 3 FULL DAYS (IF WE’RE LUCKY)
• GARBAGE COLLECTOR NIGHTMARE, CONSTANTLY FLAPPING
CLUSTER
11. MMAPFS VS NIOFS
• MMAPFS : MAPS LUCENE FILES ON THE VIRTUAL MEMORY
USING MMAP. NEEDS AS MUCH MEMORY AS THE FILE BEING
MAPPED
• NIOFS : APPLIES A SHARED LOCK ON LUCENE FILES AND
RELIES ON THE FILE SYSTEM CACHE
12. CMS VS G1GC
• CMS: SHARED CPU TIME WITH THE APPLICATION.“STOPS
THE WORLD” WHEN TOO MANY MEMORY TO CLEAN UNTIL IT
SENDS AN OUTOFMEMORYERROR
• G1GC: SHORT, MORE FREQUENT, PAUSES. WON’T STOP A
NODE UNTIL IT LEAVES THE CLUSTER
13. G1GC OPTIONS
MAXGCPAUSEMILLIS=200: ENSURE LONGER GARBAGE
COLLECTION
GCPAUSEINTERVALMILLIS=1000: BUT LESS FREQUENT
INITIATINGHEAPOCCUPANCYPERCENT=35: STARTS
COLLECTING WHEN THE HEAP IS 35% USED
14. FIELD DATA CACHE EXPIRE
• FORCES ELASTICSEARCH TO PERIODICALLY EMPTY ITS INTERNAL
FIELDDATA CACHE
• OVERLAPS THE GARBAGE COLLECTOR JOB
• PERFORMANCES ISSUES WITH FREQUENTLY ACCESSED DATA
• USE OF FIELD BREAKERS TO STOP GREEDY QUERIES
• ELASTIC SAYS NEVER DO THIS!!! BUT IT FIXED OUR BIGGEST
PROBLEM
15. MORE PROBLEMS
• IMMUTABLE, MONOLITHIC MAPPING: NEW FEATURE BLOCKED
UNTIL WE FIX IT
• IMPOSSIBLE TO DELETE A DASHBOARD WITHOUT
REINDEXING A WHOLE MONTH
• 20% DELETED DOCUMENTS WASTING 3TB
16. IMMUTABLE MAPPING AND DELETED DATA
• SEGMENTS : IMMUTABLE FILES USED BY LUCENE TO WRITE
ITS DATA. UP TO 2500 / SHARD (!!!)
• NO REAL DELETE: UPDATED AND DELETED DOCUMENTS GET
THE DELETED FLAG
• ELASTICSEARCH _OPTIMIZE: MERGE A SHARD SEGMENTS IN
1 AND PURGE DELETED DOCS
• BUT: REQUIRES 150% OF THE SHARD SIZE ON DISK
20. NEW PRODUCT DESIGN
• 1 INDEX / DASHBOARD, 1 SHARD / 5 MILLIONS DE DOCS
• VERSIONED MAPPING: MAPPING_ID__DASHBOARD_ID
• MULTIPLE MAPPING VERSIONS OF A DASHBOARD IN //
• MAPPING UPGRADE AND REINDEX WITHOUT INTERRUPTION
• BALDUR FOR DASHBOARDS ROUTING
22. BALDUR IN A NUTSHELL
1. THE API SERVER SENDS AN ELASTIC SEARCH QUERY
2. BALDUR INTERCEPTS THE QUERY AND GETS THE
DASHBOARD CLUSTER ID AND ACTIVE MAPPING VERSION
3. BALDUR ROUTES THE QUERY TO THE CLUSTER HOSTING
THE DASHBOARD DATA
23. ADDING A MAPPING VERSION
• THE INDEXER CREATES NEW NEW_MAPPING_ID__DASHBOARD_ID
INDEX
• THE INDEXER ADDS A LINE IN BALDUR’S DATABASE WITH THE
DASHBOARD AND MAPPING IDS
• THE INDEXERS INDEXES BOTH MAPPING_ID__DASHBOARD_ID AND
NEW_MAPPING_ID__DASHBOARD_ID
• WHEN NEW_MAPPING_ID__DASHBOARD_ID HAS CAUGHT UP,
BALDER SWITCHES THE ACTIVE MAPPING
24. TOO MANY LUCENE SEGMENTS
• EACH DATA NODE HOSTS 1000S LUCENE SEGMENTS
• 75% OF THE HEAP IS USED FOR SEGMENTS MANAGEMENT
• WE CREATE MORE SEGMENTS THAN WE’RE ABLE TO OPTIMIZE
• CONTINUOUS OPTIMISATION SCRIPTS, INDEXES WITH THE
MOST DELETED DOCS FIRST
• CONTINUOUS OLD INDEXES CLEANUP
25. MYSQL CAN’T RESIST
• 5000 DOCS, RANDOM READS BASED, BULK INDEXING PUTS
MYSQL ON THEIR KNEES
• FETCH THE DOCUMENTS IN BLACKHOLE BY 5000
• IF SOME DOCUMENTS ARE MISSING, FETCH IN MYSQL
• RESULT: 99.9% DOCUMENTS EXTRACTED FROM BLACKHOLE,
THROUGHPUT * 5
27. RACK AWARENESS IN A NUTSHELL
1. DEFINE 2 VIRTUAL RACK IDS
2. ASSIGN EACH DATA NODES A RACK
3. ENABLE RACK AWARENESS
4. PRIMARY SHARDS PICK UP A SIDE, REPLICA PICK UP THE
OTHER ONE
28. FULL CLUSTER RESTART IN 20 MINUTES
• CONFIGURATION TUNING REQUIRES LOTS OF RESTART
• RELY ON RACK AWARENESS TO RESTART HALF CLUSTER AT
ONCE
• BLOCK SHARD ALLOCATION DURING SERVICE RESTART
• GET GREEN
• REPEAT
30. WHERE DO YOU GO, MY LOVELY?
• PROBLEM: HOW DO WE KNOW IN WHICH DASHBOARD FITS A
NEW DOCUMENT
• STOP RELYING ON MYSQL AND SPHINX
• 50M NEW DOCUMENTS TO PROCESS A DAY
31. SOLUTION : PERCOLATION
• REVERSE DIRECTORY SYSTEM
• WE STORE QUERIES, NOT DOCUMENTS
• FOR EACH NEW DOCUMENT, WE MATCH THE DOCUMENT
AGAINST OUR QUERIES
32. PERCOLATION ISSUES
• IT TRIES TO MATCH EVERY STORED QUERY
• SO FAR, WE HAVE 35000 STORE QUERIES
• RAW USE: 1.750.000.000.000 MATCHES A DAY
• CPU GREEDY
33. SOLUTIONS
• ROUTING WITH THE DASHBOARD AND DOCUMENT
LANGUAGES
• FILTER AGAINST THE QUERY SECOND
• RESULT: UP TO 100.000 QUERIES / SECOND
37. BEFORE ELASTICSEARCH
• RUN QUERIES AGAINST A SPHINX CLUSTER TO GET THE
RIGHT DOCUMENTS ID
• FETCH THE DOCUMENTS FROM A GALERA CLUSTER AND THE
METADATA FROM ANOTHER GALERA CLUSTER
• MERGE AND DISPLAY THE DOCUMENTS
38. SPHINX NIGHTMARE
• CAN’T SCALE HORIZONTALY: LIMITED TO 14 MONTHS OF DATA
• A COMPLEX QUERY AND THE WHOLE CLUSTER REACHES 400
LOAD
• MYSQL LOAD
40. INDEXING
• A GO PROGRAM MERGES 3 GALERA CLUSTERS AND 1 ES
CLUSTER INTO A KAFKA QUEUE: 30.000 DOCUMENTS /
SECOND
• 8 GO INDEXERS MAP THE INDEX / DATA NODE
DISTRIBUTION AND PUSH THE DATA DIRECTLY ON THE RIGHT
DATA NODE: 60.000 DOCUMENTS / SECOND DURING 3
WEEKS WITH 200.000 / SECOND PEAKS
42. KAFKA IS TOO SLOW
• THE 72TB KAFKA QUEUE IS TOO SLOW: 10000
DOCUMENTS / SECOND / PARTITION ONLY BECAUSE
SPINNING DISKS
43. MASSIVE QUERIES CRASH HALF THE CLUSTER
• ELASTICSEARCH CACHES THE RESULT OF FILTERED QUERIES:
SET _CACHE TO FALSE.
• UPGRADE TO 1.7.5: FILTERED QUERIES CACHE HAVE A
MEMORY LEAK IN 1.7.4
44. BIGGEST QUERIES ARE SLOW AS HELL
• DIVIDE THE GLOBAL QUERIES PER INDEX AND RUN IN
PARALLEL
• PROCESS THE RESULTS POST QUERY ON THE API LEVEL
46. UPGRADING A MAPPING, ON A LIVE CLUSTER
• CAN’T CHANGE A FIELD TYPE
• CAN’T UPDATE ANALYZERS LIVE
• CAN’T REORGANIZE THE MAPPING WITH EXISTING DATA
47. CLUSTER DESIGN
• 36 MONTHS DATA, 40 BILLION DOCUMENTS
• 70 DATA NODE, 3 MASTERS
• 1 INDEX PER DAY, 12 SHARDS, 1 REPLICA
48. REINDEXING
• USE LOGSTASH TO READ, TRANSFORM AND WRITE EXISTING
DATA ON EACH DATA NODE
• EACH DATA NODE WRITES ON A DEFINED FRIEND IN THE
OTHER PART OF THE CLUSTER
• 5000 DOCUMENTS SCROLL, 10 INDEXING WORKERS
• SCROLLS AGAINST A FULL DAY
50. ES CONFIGURATION CHANGES
MEMORY:
INDEX_BUFFER_SIZE: 50% (INSTEAD OF 10%)
INDEX:
STORE:
THROTTLE:
TYPE : "NONE" (AS FAST AS YOUR SSD CAN GO)
TRANSLOG:
DISABLE_FLUSH: TRUE
REFRESH_INTERVAL: -1 (INSTEAD OF 1S)
INDICES:
STORE:
THROTTLE:
MAX_BYTES_PER_SEC: "2GB"
51. PROBLEMS
• MISSING DOCUMENTS AS SCROLL LOST ITS SEARCH CONTEXT
• SOMETIMES, THE INDEXING NODES CRASH
• LOGSTASH DOES NOT LIKE NETWORK ISSUES
• NEED TO REPLAY A FULL DAY TO CATCH UP WITH THE DATA
52. SOLUTIONS
• PLAY HOURLY QUERIES
• WRITE A SMALL ORCHESTRATOR
• INTRODUCING YOKO AND MOULINETTE
53. YOKO, THE REINDEXING ORCHESTRATOR
• SMALL PYTHON DAEMON TO QUERY A MYSQL DATABASE
• INDEX FROM
• INDEX TO
• LOGSTASH QUERY
• STATUS: TODO, PROCESSING, DONE, COMPLETE, FAILED
54. YOKO, THE REINDEXING ORCHESTRATOR
• CREATES THE DAILY INDEXES.
• COMPARES THE NUMBER OF DOCUMENTS FROM THE INITIAL
INDEX RUNNING THE LOGSTASH QUERY ON “DONE" INDEXES.
• MOVES EACH SUCCESSFUL "DONE" LINE TO "COMPLETE" IF THE
COUNT MATCHES OR "FAILED".
• DELETE EACH MONTHLY INDEX WHEN EVERY DAY OF A MONTH
IS "COMPLETE".
55. MOULINETTE, THE REINDEXING SCRIPT
• SMALL BASH SCRIPT THAT QUERIES YOKO
• GENERATES THE LOGSTASH.CONF FILE FROM YOKO DATA
• RUNS LOGSTASH
• SWITCHES YOKO LINE TO DONE WHEN DONE
56.
57. PROBLEMS
• LOGSTASH TRANSFORM FIELDS IS SLOW
• SHOULD RUN INDEXING ON LESS NODES
• SOMETIMES, LOGSTASH HANGS UP AND NEEDS TO BE
FORCED KILLED
• YOKO SHOULD DETECT THIS AND RAISE AN ERROR
59. BEFORE UPGRADING
• CHECK YOUR PLUGINS CAN RUN ON 2.X
• CHECK YOUR MAPPINGS ARE 2.X COMPLIANT
• CHECK FOR CONFIGURATION DEPRECATION
60. SEEMS EASY?
1. SHUTDOWN CLUSTER
2. UPGRADE ES TO 2.3
3. UPGRADE PLUGINS
4. START THE WHOLE CLUSTER, MASTERS FIRST
61. UNSUPPORTED PLUGINS AND ANALYZERS
• CAN’T UPDATE AN ANALYZER ON AN OPEN INDEX
• CLOSE ALL INDEXES
• APPLY A TEMPORARY DUMMY ANALYZER
• REOPEN INDEXES