SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
MIGRATING A 130TB CLUSTER FROM
ELASTICSEARCH 2 TO 5 IN 20 HOURS WITHOUT
DOWNTIME
FRED DE VILLAMIL
@FDEVILLAMIL
OCTOBER 2017
ABOUT ME
FRED DE VILLAMIL, FORMER DIRECTOR OF INFRASTRUCTURE
@SYNTHESIO
FIRST ELASTICSEARCH IN PRODUCTION WAS 0.17.6
LINUX / (FREE)BSD USER SINCE 1996,
OPEN SOURCE CONTRIBUTOR SINCE 1998,
LOVES COOL TECHS, TENNIS, PHOTOGRAPHY, CUTE OTTERS,
INAPPROPRIATE HUMOR AND ELASTICSEARCH CLUSTERS OF UNUSUAL
SIZE.
WRITES ABOUT ES & MORE AT HTTPS://THOUGHTS.T37.NET
ABOUT SYNTHESIO
SYNTHESIO IS THE LEADING SOCIAL INTELLIGENCE TOOL FOR
SOCIAL MEDIA MONITORING & SOCIAL ANALYTICS
SYNTHESIO CRAWLS THE WEB FOR RELEVANT DATA, ENRICHES
IT WITH SENTIMENT ANALYSIS AND DEMOGRAPHICS TO BUILD
SOCIAL ANALYTICS DASHBOARDS.
ELASTICSEARCH @SYNTHESIO
8 production clusters:
• +600 hosts, all bare metal
• 3 data center
• 1.7PB storage SSD / NVME
• 37.5TB RAM
Hardware:
• 6 core Xeon E5v3 or bi Xeon E5-2687Wv4
12 core (160 watts!!!)
• 64GB to 256GB RAM
• 4 x 800GB SSD / 2 x 1.2TB NVME
• RAID0 everywhere
We agregate data from various cold
storage and make them searchable in a
giffy.
Average cluster stats
• writes: 85k documents / second, 1.5M
in peak
• 800 search /s, with some cluster
having a continuous 25k search /
second
• Doc size from 150KB to 200MB
THE BLACKHOLE CLUSTER
Topology
• 68 data nodes
• 3 master nodes
• 6 ingest nodes
• 200TB storage SSD
• 2.4TB heap
• 924 core
Cluster stats:
• 1137 indices (daily)
• 27266, shards
• 130TB data
• 201 billion documents
• 7000 new documents / second
• 800 search / second on the whole dataset
FEEDING BLACKHOLE FOR FUN AND PROFIT
BLACKHOLE ALLOCATION SETTINGS
"CLUSTER.ROUTING.ALLOCATION.NODE_INITIAL_PRIMARIES_RECOVERIES": 50
"CLUSTER.ROUTING.ALLOCATION.NODE_CONCURRENT_RECOVERIES": 20
"INDICES.RECOVERY.MAX_BYTES_PER_SEC": "2048MB"
"INDICES.RECOVERY.CONCURRENT_STREAMS": "30"
"CLUSTER.ROUTING.ALLOCATION.DISK.THRESHOLD_ENABLED" : TRUE
"CLUSTER.ROUTING.ALLOCATION.DISK.WATERMARK.LOW" : "78%"
"CLUSTER.ROUTING.ALLOCATION.DISK.WATERMARK.HIGH" : "79%"
“CLUSTER.ROUTING.REBALANCE.ENABLE": "ALL"
"CLUSTER.ROUTING.ALLOCATION.CLUSTER_CONCURRENT_REBALANCE": 50
"CLUSTER.ROUTING.ALLOCATION.ALLOW_REBALANCE": "ALWAYS"
USING THE REINDEX API?
REINDEX API:
• NO SLICED SCROLL UNTIL ES
6.0
• SLOW
• MIGHT LOSE SOME DOCUMENTS,
NEEDS LOTS OF ERROR CONTROL
LOGSTASH:
• NO SLICED SCROLLS UNTIL ES
6.0
• FASTER THAN THE REINDEX API
• REALLY DOESN’T LIKE ERRORS
BEFORE UPGRADING
• USE THE UPGRADE CHECK PLUGIN TO VALIDATE CURRENT INDEXES
COMPATIBILITY
• UPGRADE YOUR MAPPING TEMPLATES TO BE ES 5 COMPLIANT
• CREATE THE NEXT 10 DAYS INDEXES (JUST IN CASE)
• TELL YOUR HOSTING PROVIDER YOU’RE GOING TO TRANSFER 130TB
IN 17 HOURS
EXPANDING BLACKHOLE
OPS:
• +90 NEW SERVERS IN 2 NEW RACKS
• RAISED THE REPLICATION FACTOR TO 3
RESULT:
• 167 NODES
• 53626 SHARDS
• 279TB DATA
• 391TB STORAGE
• 5.42TB HEAP
• 2004 CORE
SETTINGS UPDATE DURING THE REPLICA INIT
"INDICES.RECOVERY.MAX_BYTES_PER_SEC": “4096MB"
"INDICES.RECOVERY.CONCURRENT_STREAMS": "50"
"CLUSTER.ROUTING.ALLOCATION.DISK.WATERMARK.LOW" : "98%"
"CLUSTER.ROUTING.ALLOCATION.DISK.WATERMARK.HIGH" : “99%"
"CLUSTER.ROUTING.REBALANCE.ENABLE": “NONE"
PROBLEMS
• THE TRANSFER PUT THE
WHOLE CLUSTER ON
THEIR KNEES.
• THIS SLOWERS THE
WRITES.
• THE BULK THREAD POOL
STARTS TO FILL IN.
SOLUTION: ZONING FOR FUN & PROFIT
• ALLOCATE THE FRESHEST DATA AND
ONGOING IN A ZONE
• SEGREGATE EVERYTHING ELSE IN A
DIFFERENT ZONE
• WAIT FOR THE CLUSTER TO CALM
DOWN
• TOTAL SPENT TIME FOR THE
TRANSFER: 17 HOURS
SPLITTING THE CLUSTER IN 2
• SET
"CLUSTER.ROUTING.ALLOC
ATION.ENABLE" TO "ALL"
• SHUTDOWN 2 OF THE RACKS
• SHUTDOWN ONE OF THE
MASTERS
• SWITCH THE NUMBER OF
REPLICAS TO 1
BUILDING BLACKHOLE02
• RECONFIGURE THE 2 SHUTDOWN RACKS AND MASTER SO
THEY TALK TO EACH OTHER
• START THE MASTER, ALONE, CLOSE THE INDEXES
• UPGRADE THE MASTER TO ES 5.1.1
• UPGRADE ALL THE PLUGINS
• START THE MASTER: THE WHOLE UPGRADE TOOK 32 SECONDS
BRINGIN BACK THE DATA
• UPGRADE ES AND THE PLUGINS ON THE DATA NODES
• START ELASTICSEARCH
• WAIT 30 MINUTES FOR THE CLUSTER TO GO BACK GREEN
• PLUG A WORK UNIT TO CATCH UP WITH THE PAST 18 HOURS
OF DATA
• UPDATE THE LOAD BALANCER CONFIGURATION TO USE THE
NEWLY UPGRADED CLUSTER
TIMELINE
QUESTIONS ?
@FDEVILLAMIL

Contenu connexe

Tendances

Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionDataStax Academy
 
Seattle Cassandra Meetup - HasOffers
Seattle Cassandra Meetup - HasOffersSeattle Cassandra Meetup - HasOffers
Seattle Cassandra Meetup - HasOffersbtoddb
 
Apache HBase at Airbnb
Apache HBase at Airbnb Apache HBase at Airbnb
Apache HBase at Airbnb HBaseCon
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series DatabasePramit Choudhary
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series WorldMapR Technologies
 
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...Altinity Ltd
 
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014Amazon Web Services
 
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...DataStax Academy
 
Blueflood: Open Source Metrics Processing at CassandraEU 2013
Blueflood: Open Source Metrics Processing at CassandraEU 2013Blueflood: Open Source Metrics Processing at CassandraEU 2013
Blueflood: Open Source Metrics Processing at CassandraEU 2013gdusbabek
 
ELK: a log management framework
ELK: a log management frameworkELK: a log management framework
ELK: a log management frameworkGiovanni Bechis
 
Taking Your Database Global with Kubernetes
Taking Your Database Global with KubernetesTaking Your Database Global with Kubernetes
Taking Your Database Global with KubernetesChristopher Bradford
 
Artmosphere Demo
Artmosphere DemoArtmosphere Demo
Artmosphere DemoKeira Zhou
 
Scaling Writes on CockroachDB with Apache NiFi
Scaling Writes on CockroachDB with Apache NiFiScaling Writes on CockroachDB with Apache NiFi
Scaling Writes on CockroachDB with Apache NiFiChris Casano
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffTimescale
 
Fake It 'Til You Make It
Fake It 'Til You Make ItFake It 'Til You Make It
Fake It 'Til You Make ItJohn Stanford
 
Building a Data Plane with K8ssandra, Apache Cassandra on Kubernetes
Building a Data Plane with K8ssandra, Apache Cassandra on KubernetesBuilding a Data Plane with K8ssandra, Apache Cassandra on Kubernetes
Building a Data Plane with K8ssandra, Apache Cassandra on KubernetesChristopher Bradford
 
Introduction to InfluxDB and TICK Stack
Introduction to InfluxDB and TICK StackIntroduction to InfluxDB and TICK Stack
Introduction to InfluxDB and TICK StackAhmed AbouZaid
 
Geospatial Data Visualization: WorldMap Integration by Raman Prasad
Geospatial Data Visualization: WorldMap Integration by Raman PrasadGeospatial Data Visualization: WorldMap Integration by Raman Prasad
Geospatial Data Visualization: WorldMap Integration by Raman Prasaddatascienceiqss
 

Tendances (20)

Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
 
Seattle Cassandra Meetup - HasOffers
Seattle Cassandra Meetup - HasOffersSeattle Cassandra Meetup - HasOffers
Seattle Cassandra Meetup - HasOffers
 
Apache HBase at Airbnb
Apache HBase at Airbnb Apache HBase at Airbnb
Apache HBase at Airbnb
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series World
 
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
 
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
 
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
 
Blueflood: Open Source Metrics Processing at CassandraEU 2013
Blueflood: Open Source Metrics Processing at CassandraEU 2013Blueflood: Open Source Metrics Processing at CassandraEU 2013
Blueflood: Open Source Metrics Processing at CassandraEU 2013
 
ELK: a log management framework
ELK: a log management frameworkELK: a log management framework
ELK: a log management framework
 
Taking Your Database Global with Kubernetes
Taking Your Database Global with KubernetesTaking Your Database Global with Kubernetes
Taking Your Database Global with Kubernetes
 
Artmosphere Demo
Artmosphere DemoArtmosphere Demo
Artmosphere Demo
 
Scaling Writes on CockroachDB with Apache NiFi
Scaling Writes on CockroachDB with Apache NiFiScaling Writes on CockroachDB with Apache NiFi
Scaling Writes on CockroachDB with Apache NiFi
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
 
Kafka Workshop
Kafka WorkshopKafka Workshop
Kafka Workshop
 
Fake It 'Til You Make It
Fake It 'Til You Make ItFake It 'Til You Make It
Fake It 'Til You Make It
 
Timezone Mess
Timezone MessTimezone Mess
Timezone Mess
 
Building a Data Plane with K8ssandra, Apache Cassandra on Kubernetes
Building a Data Plane with K8ssandra, Apache Cassandra on KubernetesBuilding a Data Plane with K8ssandra, Apache Cassandra on Kubernetes
Building a Data Plane with K8ssandra, Apache Cassandra on Kubernetes
 
Introduction to InfluxDB and TICK Stack
Introduction to InfluxDB and TICK StackIntroduction to InfluxDB and TICK Stack
Introduction to InfluxDB and TICK Stack
 
Geospatial Data Visualization: WorldMap Integration by Raman Prasad
Geospatial Data Visualization: WorldMap Integration by Raman PrasadGeospatial Data Visualization: WorldMap Integration by Raman Prasad
Geospatial Data Visualization: WorldMap Integration by Raman Prasad
 

Similaire à Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime

Apache Mesos and the new Open Source Architecture of the Modern Datacenter
Apache Mesos and the new Open Source Architecture of the Modern DatacenterApache Mesos and the new Open Source Architecture of the Modern Datacenter
Apache Mesos and the new Open Source Architecture of the Modern DatacenterData Con LA
 
AWS Summit Kuala Lumpur - Opening Keynote by Dr. Werner Vogels
AWS Summit Kuala Lumpur - Opening Keynote by Dr. Werner VogelsAWS Summit Kuala Lumpur - Opening Keynote by Dr. Werner Vogels
AWS Summit Kuala Lumpur - Opening Keynote by Dr. Werner VogelsAmazon Web Services
 
Enabling Lean IT with AWS by Carlos Condé at the Lean IT Summit 2014
Enabling Lean IT with AWS by Carlos Condé at the Lean IT Summit 2014Enabling Lean IT with AWS by Carlos Condé at the Lean IT Summit 2014
Enabling Lean IT with AWS by Carlos Condé at the Lean IT Summit 2014Institut Lean France
 
Afterwork big data et data viz - du lac à votre écran
Afterwork big data et data viz - du lac à votre écranAfterwork big data et data viz - du lac à votre écran
Afterwork big data et data viz - du lac à votre écranJoseph Glorieux
 
Peter Diamandis slides 18-1 e le Organizzazioni a crescita esponenziale
Peter Diamandis slides 18-1 e le Organizzazioni a crescita esponenzialePeter Diamandis slides 18-1 e le Organizzazioni a crescita esponenziale
Peter Diamandis slides 18-1 e le Organizzazioni a crescita esponenzialeRilevanteam
 
big data et data viz - du lac à votre écran - afterwork
big data et data viz - du lac à votre écran - afterwork big data et data viz - du lac à votre écran - afterwork
big data et data viz - du lac à votre écran - afterwork OCTO Technology Suisse
 
Multi-Tenant Hybrid Solution based on Hybrid Connections & App Service
Multi-Tenant Hybrid Solution based on Hybrid Connections & App ServiceMulti-Tenant Hybrid Solution based on Hybrid Connections & App Service
Multi-Tenant Hybrid Solution based on Hybrid Connections & App ServiceAlexander Laysha
 
Designing for Sustainability - WebVisions 2016
Designing for Sustainability - WebVisions 2016Designing for Sustainability - WebVisions 2016
Designing for Sustainability - WebVisions 2016Tim Frick
 
The Evolution of Blue Ocean Databases, from SQL to Blockchain
The Evolution of Blue Ocean Databases, from SQL to BlockchainThe Evolution of Blue Ocean Databases, from SQL to Blockchain
The Evolution of Blue Ocean Databases, from SQL to BlockchainTrent McConaghy
 
HUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation SlidesHUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation SlidesJohn Mulhall
 
Turning Business Drivers into Business
Turning Business Drivers into BusinessTurning Business Drivers into Business
Turning Business Drivers into BusinessPanduit
 
Alluxio Mesos Meetup - SMACK to SMAACK
Alluxio Mesos Meetup - SMACK to SMAACKAlluxio Mesos Meetup - SMACK to SMAACK
Alluxio Mesos Meetup - SMACK to SMAACKAlluxio, Inc.
 
Opening presentation by Trent McConaghy at BigchainDB Hackfest #1 - Feb 28, 2017
Opening presentation by Trent McConaghy at BigchainDB Hackfest #1 - Feb 28, 2017Opening presentation by Trent McConaghy at BigchainDB Hackfest #1 - Feb 28, 2017
Opening presentation by Trent McConaghy at BigchainDB Hackfest #1 - Feb 28, 2017BigchainDB
 
Cloud Foundry vs Docker vs Kubernetes - http://bit.ly/2rzUM2U
Cloud Foundry vs Docker vs Kubernetes - http://bit.ly/2rzUM2UCloud Foundry vs Docker vs Kubernetes - http://bit.ly/2rzUM2U
Cloud Foundry vs Docker vs Kubernetes - http://bit.ly/2rzUM2USufyaan Kazi
 
From the Big Bang to Ecommerce, a journey in making sense of Big Data
From the Big Bang to Ecommerce, a journey in making sense of Big DataFrom the Big Bang to Ecommerce, a journey in making sense of Big Data
From the Big Bang to Ecommerce, a journey in making sense of Big DataPatrick Deglon
 
DOO-007_How to run containers in production, at scale!
DOO-007_How to run containers in production, at scale!DOO-007_How to run containers in production, at scale!
DOO-007_How to run containers in production, at scale!decode2016
 
Couchbase and Apache Kafka - Bridging the gap between RDBMS and NoSQL
Couchbase and Apache Kafka - Bridging the gap between RDBMS and NoSQLCouchbase and Apache Kafka - Bridging the gap between RDBMS and NoSQL
Couchbase and Apache Kafka - Bridging the gap between RDBMS and NoSQLDATAVERSITY
 
A Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big DataA Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big DataEdward Hsu
 
Flink Forward San Francisco 2019: Flink Powered Customer Experience: Scaling ...
Flink Forward San Francisco 2019: Flink Powered Customer Experience: Scaling ...Flink Forward San Francisco 2019: Flink Powered Customer Experience: Scaling ...
Flink Forward San Francisco 2019: Flink Powered Customer Experience: Scaling ...Flink Forward
 

Similaire à Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime (20)

Apache Mesos and the new Open Source Architecture of the Modern Datacenter
Apache Mesos and the new Open Source Architecture of the Modern DatacenterApache Mesos and the new Open Source Architecture of the Modern Datacenter
Apache Mesos and the new Open Source Architecture of the Modern Datacenter
 
AWS Summit Kuala Lumpur - Opening Keynote by Dr. Werner Vogels
AWS Summit Kuala Lumpur - Opening Keynote by Dr. Werner VogelsAWS Summit Kuala Lumpur - Opening Keynote by Dr. Werner Vogels
AWS Summit Kuala Lumpur - Opening Keynote by Dr. Werner Vogels
 
Enabling Lean IT with AWS by Carlos Condé at the Lean IT Summit 2014
Enabling Lean IT with AWS by Carlos Condé at the Lean IT Summit 2014Enabling Lean IT with AWS by Carlos Condé at the Lean IT Summit 2014
Enabling Lean IT with AWS by Carlos Condé at the Lean IT Summit 2014
 
Afterwork big data et data viz - du lac à votre écran
Afterwork big data et data viz - du lac à votre écranAfterwork big data et data viz - du lac à votre écran
Afterwork big data et data viz - du lac à votre écran
 
Peter Diamandis slides 18-1 e le Organizzazioni a crescita esponenziale
Peter Diamandis slides 18-1 e le Organizzazioni a crescita esponenzialePeter Diamandis slides 18-1 e le Organizzazioni a crescita esponenziale
Peter Diamandis slides 18-1 e le Organizzazioni a crescita esponenziale
 
big data et data viz - du lac à votre écran - afterwork
big data et data viz - du lac à votre écran - afterwork big data et data viz - du lac à votre écran - afterwork
big data et data viz - du lac à votre écran - afterwork
 
Multi-Tenant Hybrid Solution based on Hybrid Connections & App Service
Multi-Tenant Hybrid Solution based on Hybrid Connections & App ServiceMulti-Tenant Hybrid Solution based on Hybrid Connections & App Service
Multi-Tenant Hybrid Solution based on Hybrid Connections & App Service
 
Designing for Sustainability - WebVisions 2016
Designing for Sustainability - WebVisions 2016Designing for Sustainability - WebVisions 2016
Designing for Sustainability - WebVisions 2016
 
The Evolution of Blue Ocean Databases, from SQL to Blockchain
The Evolution of Blue Ocean Databases, from SQL to BlockchainThe Evolution of Blue Ocean Databases, from SQL to Blockchain
The Evolution of Blue Ocean Databases, from SQL to Blockchain
 
HUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation SlidesHUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation Slides
 
Turning Business Drivers into Business
Turning Business Drivers into BusinessTurning Business Drivers into Business
Turning Business Drivers into Business
 
Alluxio Mesos Meetup - SMACK to SMAACK
Alluxio Mesos Meetup - SMACK to SMAACKAlluxio Mesos Meetup - SMACK to SMAACK
Alluxio Mesos Meetup - SMACK to SMAACK
 
Opening presentation by Trent McConaghy at BigchainDB Hackfest #1 - Feb 28, 2017
Opening presentation by Trent McConaghy at BigchainDB Hackfest #1 - Feb 28, 2017Opening presentation by Trent McConaghy at BigchainDB Hackfest #1 - Feb 28, 2017
Opening presentation by Trent McConaghy at BigchainDB Hackfest #1 - Feb 28, 2017
 
Cloud Foundry vs Docker vs Kubernetes - http://bit.ly/2rzUM2U
Cloud Foundry vs Docker vs Kubernetes - http://bit.ly/2rzUM2UCloud Foundry vs Docker vs Kubernetes - http://bit.ly/2rzUM2U
Cloud Foundry vs Docker vs Kubernetes - http://bit.ly/2rzUM2U
 
From the Big Bang to Ecommerce, a journey in making sense of Big Data
From the Big Bang to Ecommerce, a journey in making sense of Big DataFrom the Big Bang to Ecommerce, a journey in making sense of Big Data
From the Big Bang to Ecommerce, a journey in making sense of Big Data
 
DOO-007_How to run containers in production, at scale!
DOO-007_How to run containers in production, at scale!DOO-007_How to run containers in production, at scale!
DOO-007_How to run containers in production, at scale!
 
Couchbase and Apache Kafka - Bridging the gap between RDBMS and NoSQL
Couchbase and Apache Kafka - Bridging the gap between RDBMS and NoSQLCouchbase and Apache Kafka - Bridging the gap between RDBMS and NoSQL
Couchbase and Apache Kafka - Bridging the gap between RDBMS and NoSQL
 
A Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big DataA Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big Data
 
Flink Forward San Francisco 2019: Flink Powered Customer Experience: Scaling ...
Flink Forward San Francisco 2019: Flink Powered Customer Experience: Scaling ...Flink Forward San Francisco 2019: Flink Powered Customer Experience: Scaling ...
Flink Forward San Francisco 2019: Flink Powered Customer Experience: Scaling ...
 
Trm_pitch_final
Trm_pitch_finalTrm_pitch_final
Trm_pitch_final
 

Plus de Fred de Villamil

Scaling your Engineering Team
Scaling your Engineering TeamScaling your Engineering Team
Scaling your Engineering TeamFred de Villamil
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...Fred de Villamil
 
Hiring and Managing Happy Engineers - CTO Pizza #3
Hiring and Managing Happy Engineers - CTO Pizza #3Hiring and Managing Happy Engineers - CTO Pizza #3
Hiring and Managing Happy Engineers - CTO Pizza #3Fred de Villamil
 
Running & Scaling Large Elasticsearch Clusters
Running & Scaling Large Elasticsearch ClustersRunning & Scaling Large Elasticsearch Clusters
Running & Scaling Large Elasticsearch ClustersFred de Villamil
 
Devops commando - Paris Devops 2016-04
Devops commando - Paris Devops 2016-04Devops commando - Paris Devops 2016-04
Devops commando - Paris Devops 2016-04Fred de Villamil
 
Applications Web En Entreprise Avec Ruby On Rails Benefices Et Limitations Gu...
Applications Web En Entreprise Avec Ruby On Rails Benefices Et Limitations Gu...Applications Web En Entreprise Avec Ruby On Rails Benefices Et Limitations Gu...
Applications Web En Entreprise Avec Ruby On Rails Benefices Et Limitations Gu...Fred de Villamil
 

Plus de Fred de Villamil (10)

Scaling your Engineering Team
Scaling your Engineering TeamScaling your Engineering Team
Scaling your Engineering Team
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
 
Hiring and Managing Happy Engineers - CTO Pizza #3
Hiring and Managing Happy Engineers - CTO Pizza #3Hiring and Managing Happy Engineers - CTO Pizza #3
Hiring and Managing Happy Engineers - CTO Pizza #3
 
Running & Scaling Large Elasticsearch Clusters
Running & Scaling Large Elasticsearch ClustersRunning & Scaling Large Elasticsearch Clusters
Running & Scaling Large Elasticsearch Clusters
 
Devops commando - Paris Devops 2016-04
Devops commando - Paris Devops 2016-04Devops commando - Paris Devops 2016-04
Devops commando - Paris Devops 2016-04
 
The Commando Devops
The Commando DevopsThe Commando Devops
The Commando Devops
 
How People Use Iphone
How People Use IphoneHow People Use Iphone
How People Use Iphone
 
Zendcon Performance Oci8
Zendcon Performance Oci8Zendcon Performance Oci8
Zendcon Performance Oci8
 
Applications Web En Entreprise Avec Ruby On Rails Benefices Et Limitations Gu...
Applications Web En Entreprise Avec Ruby On Rails Benefices Et Limitations Gu...Applications Web En Entreprise Avec Ruby On Rails Benefices Et Limitations Gu...
Applications Web En Entreprise Avec Ruby On Rails Benefices Et Limitations Gu...
 
Presentation Rails
Presentation RailsPresentation Rails
Presentation Rails
 

Dernier

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime

  • 1. MIGRATING A 130TB CLUSTER FROM ELASTICSEARCH 2 TO 5 IN 20 HOURS WITHOUT DOWNTIME FRED DE VILLAMIL @FDEVILLAMIL OCTOBER 2017
  • 2. ABOUT ME FRED DE VILLAMIL, FORMER DIRECTOR OF INFRASTRUCTURE @SYNTHESIO FIRST ELASTICSEARCH IN PRODUCTION WAS 0.17.6 LINUX / (FREE)BSD USER SINCE 1996, OPEN SOURCE CONTRIBUTOR SINCE 1998, LOVES COOL TECHS, TENNIS, PHOTOGRAPHY, CUTE OTTERS, INAPPROPRIATE HUMOR AND ELASTICSEARCH CLUSTERS OF UNUSUAL SIZE. WRITES ABOUT ES & MORE AT HTTPS://THOUGHTS.T37.NET
  • 3. ABOUT SYNTHESIO SYNTHESIO IS THE LEADING SOCIAL INTELLIGENCE TOOL FOR SOCIAL MEDIA MONITORING & SOCIAL ANALYTICS SYNTHESIO CRAWLS THE WEB FOR RELEVANT DATA, ENRICHES IT WITH SENTIMENT ANALYSIS AND DEMOGRAPHICS TO BUILD SOCIAL ANALYTICS DASHBOARDS.
  • 4. ELASTICSEARCH @SYNTHESIO 8 production clusters: • +600 hosts, all bare metal • 3 data center • 1.7PB storage SSD / NVME • 37.5TB RAM Hardware: • 6 core Xeon E5v3 or bi Xeon E5-2687Wv4 12 core (160 watts!!!) • 64GB to 256GB RAM • 4 x 800GB SSD / 2 x 1.2TB NVME • RAID0 everywhere We agregate data from various cold storage and make them searchable in a giffy. Average cluster stats • writes: 85k documents / second, 1.5M in peak • 800 search /s, with some cluster having a continuous 25k search / second • Doc size from 150KB to 200MB
  • 5. THE BLACKHOLE CLUSTER Topology • 68 data nodes • 3 master nodes • 6 ingest nodes • 200TB storage SSD • 2.4TB heap • 924 core Cluster stats: • 1137 indices (daily) • 27266, shards • 130TB data • 201 billion documents • 7000 new documents / second • 800 search / second on the whole dataset
  • 6. FEEDING BLACKHOLE FOR FUN AND PROFIT
  • 7. BLACKHOLE ALLOCATION SETTINGS "CLUSTER.ROUTING.ALLOCATION.NODE_INITIAL_PRIMARIES_RECOVERIES": 50 "CLUSTER.ROUTING.ALLOCATION.NODE_CONCURRENT_RECOVERIES": 20 "INDICES.RECOVERY.MAX_BYTES_PER_SEC": "2048MB" "INDICES.RECOVERY.CONCURRENT_STREAMS": "30" "CLUSTER.ROUTING.ALLOCATION.DISK.THRESHOLD_ENABLED" : TRUE "CLUSTER.ROUTING.ALLOCATION.DISK.WATERMARK.LOW" : "78%" "CLUSTER.ROUTING.ALLOCATION.DISK.WATERMARK.HIGH" : "79%" “CLUSTER.ROUTING.REBALANCE.ENABLE": "ALL" "CLUSTER.ROUTING.ALLOCATION.CLUSTER_CONCURRENT_REBALANCE": 50 "CLUSTER.ROUTING.ALLOCATION.ALLOW_REBALANCE": "ALWAYS"
  • 8. USING THE REINDEX API? REINDEX API: • NO SLICED SCROLL UNTIL ES 6.0 • SLOW • MIGHT LOSE SOME DOCUMENTS, NEEDS LOTS OF ERROR CONTROL LOGSTASH: • NO SLICED SCROLLS UNTIL ES 6.0 • FASTER THAN THE REINDEX API • REALLY DOESN’T LIKE ERRORS
  • 9. BEFORE UPGRADING • USE THE UPGRADE CHECK PLUGIN TO VALIDATE CURRENT INDEXES COMPATIBILITY • UPGRADE YOUR MAPPING TEMPLATES TO BE ES 5 COMPLIANT • CREATE THE NEXT 10 DAYS INDEXES (JUST IN CASE) • TELL YOUR HOSTING PROVIDER YOU’RE GOING TO TRANSFER 130TB IN 17 HOURS
  • 10. EXPANDING BLACKHOLE OPS: • +90 NEW SERVERS IN 2 NEW RACKS • RAISED THE REPLICATION FACTOR TO 3 RESULT: • 167 NODES • 53626 SHARDS • 279TB DATA • 391TB STORAGE • 5.42TB HEAP • 2004 CORE
  • 11. SETTINGS UPDATE DURING THE REPLICA INIT "INDICES.RECOVERY.MAX_BYTES_PER_SEC": “4096MB" "INDICES.RECOVERY.CONCURRENT_STREAMS": "50" "CLUSTER.ROUTING.ALLOCATION.DISK.WATERMARK.LOW" : "98%" "CLUSTER.ROUTING.ALLOCATION.DISK.WATERMARK.HIGH" : “99%" "CLUSTER.ROUTING.REBALANCE.ENABLE": “NONE"
  • 12. PROBLEMS • THE TRANSFER PUT THE WHOLE CLUSTER ON THEIR KNEES. • THIS SLOWERS THE WRITES. • THE BULK THREAD POOL STARTS TO FILL IN.
  • 13. SOLUTION: ZONING FOR FUN & PROFIT • ALLOCATE THE FRESHEST DATA AND ONGOING IN A ZONE • SEGREGATE EVERYTHING ELSE IN A DIFFERENT ZONE • WAIT FOR THE CLUSTER TO CALM DOWN • TOTAL SPENT TIME FOR THE TRANSFER: 17 HOURS
  • 14. SPLITTING THE CLUSTER IN 2 • SET "CLUSTER.ROUTING.ALLOC ATION.ENABLE" TO "ALL" • SHUTDOWN 2 OF THE RACKS • SHUTDOWN ONE OF THE MASTERS • SWITCH THE NUMBER OF REPLICAS TO 1
  • 15. BUILDING BLACKHOLE02 • RECONFIGURE THE 2 SHUTDOWN RACKS AND MASTER SO THEY TALK TO EACH OTHER • START THE MASTER, ALONE, CLOSE THE INDEXES • UPGRADE THE MASTER TO ES 5.1.1 • UPGRADE ALL THE PLUGINS • START THE MASTER: THE WHOLE UPGRADE TOOK 32 SECONDS
  • 16. BRINGIN BACK THE DATA • UPGRADE ES AND THE PLUGINS ON THE DATA NODES • START ELASTICSEARCH • WAIT 30 MINUTES FOR THE CLUSTER TO GO BACK GREEN • PLUG A WORK UNIT TO CATCH UP WITH THE PAST 18 HOURS OF DATA • UPDATE THE LOAD BALANCER CONFIGURATION TO USE THE NEWLY UPGRADED CLUSTER