Soumettre la recherche
Mettre en ligne
OpenTSDB for monitoring @ Criteo
•
1 j'aime
•
1,215 vues
Nathaniel Braun
Suivre
Overview and case studies of infrastructure & service monitoring using OpenTSDB @ Criteo
Lire moins
Lire la suite
Logiciels
Signaler
Partager
Signaler
Partager
1 sur 81
Télécharger maintenant
Télécharger pour lire hors ligne
Recommandé
OpenTSDB 2.0
OpenTSDB 2.0
HBaseCon
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017
HBaseCon
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase
HBaseCon
Monitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDB
Geoffrey Anderson
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase
HBaseCon
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
NoSQLmatters
openTSDB - Metrics for a distributed world
openTSDB - Metrics for a distributed world
Oliver Hankeln
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
Cloudera, Inc.
Recommandé
OpenTSDB 2.0
OpenTSDB 2.0
HBaseCon
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017
HBaseCon
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase
HBaseCon
Monitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDB
Geoffrey Anderson
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase
HBaseCon
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
NoSQLmatters
openTSDB - Metrics for a distributed world
openTSDB - Metrics for a distributed world
Oliver Hankeln
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
Cloudera, Inc.
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
Cloudera, Inc.
opentsdb in a real enviroment
opentsdb in a real enviroment
Chen Robert
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
DataStax
Keynote: Apache HBase at Yahoo! Scale
Keynote: Apache HBase at Yahoo! Scale
HBaseCon
Go and Uber’s time series database m3
Go and Uber’s time series database m3
Rob Skillington
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
zznate
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon
Gnocchi v3
Gnocchi v3
Gordon Chung
Time Series Processing with Solr and Spark
Time Series Processing with Solr and Spark
Josef Adersberger
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Tao Feng
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
Florian Lautenschlager
JEEConf. Vanilla java
JEEConf. Vanilla java
Dmitriy Dumanskiy
Gnocchi v4 - past and present
Gnocchi v4 - past and present
Gordon Chung
Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016
Markus Höfer
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: Sharding
MongoDB
A Fast and Efficient Time Series Storage Based on Apache Solr
A Fast and Efficient Time Series Storage Based on Apache Solr
QAware GmbH
HBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBase
HBaseCon
Gnocchi v3 brownbag
Gnocchi v3 brownbag
Gordon Chung
The new time series kid on the block
The new time series kid on the block
Florian Lautenschlager
Back to Basics Webinar 6: Production Deployment
Back to Basics Webinar 6: Production Deployment
MongoDB
Performance is not an Option - gRPC and Cassandra
Performance is not an Option - gRPC and Cassandra
Dave Bechberger
SD Times - Docker v2
SD Times - Docker v2
Alvin Richards
Contenu connexe
Tendances
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
Cloudera, Inc.
opentsdb in a real enviroment
opentsdb in a real enviroment
Chen Robert
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
DataStax
Keynote: Apache HBase at Yahoo! Scale
Keynote: Apache HBase at Yahoo! Scale
HBaseCon
Go and Uber’s time series database m3
Go and Uber’s time series database m3
Rob Skillington
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
zznate
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon
Gnocchi v3
Gnocchi v3
Gordon Chung
Time Series Processing with Solr and Spark
Time Series Processing with Solr and Spark
Josef Adersberger
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Tao Feng
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
Florian Lautenschlager
JEEConf. Vanilla java
JEEConf. Vanilla java
Dmitriy Dumanskiy
Gnocchi v4 - past and present
Gnocchi v4 - past and present
Gordon Chung
Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016
Markus Höfer
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: Sharding
MongoDB
A Fast and Efficient Time Series Storage Based on Apache Solr
A Fast and Efficient Time Series Storage Based on Apache Solr
QAware GmbH
HBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBase
HBaseCon
Gnocchi v3 brownbag
Gnocchi v3 brownbag
Gordon Chung
The new time series kid on the block
The new time series kid on the block
Florian Lautenschlager
Back to Basics Webinar 6: Production Deployment
Back to Basics Webinar 6: Production Deployment
MongoDB
Tendances
(20)
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
opentsdb in a real enviroment
opentsdb in a real enviroment
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
Keynote: Apache HBase at Yahoo! Scale
Keynote: Apache HBase at Yahoo! Scale
Go and Uber’s time series database m3
Go and Uber’s time series database m3
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase Client
Gnocchi v3
Gnocchi v3
Time Series Processing with Solr and Spark
Time Series Processing with Solr and Spark
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
JEEConf. Vanilla java
JEEConf. Vanilla java
Gnocchi v4 - past and present
Gnocchi v4 - past and present
Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: Sharding
A Fast and Efficient Time Series Storage Based on Apache Solr
A Fast and Efficient Time Series Storage Based on Apache Solr
HBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBase
Gnocchi v3 brownbag
Gnocchi v3 brownbag
The new time series kid on the block
The new time series kid on the block
Back to Basics Webinar 6: Production Deployment
Back to Basics Webinar 6: Production Deployment
Similaire à OpenTSDB for monitoring @ Criteo
Performance is not an Option - gRPC and Cassandra
Performance is not an Option - gRPC and Cassandra
Dave Bechberger
SD Times - Docker v2
SD Times - Docker v2
Alvin Richards
Using Databases and Containers From Development to Deployment
Using Databases and Containers From Development to Deployment
Aerospike, Inc.
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
DataStax
Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
HBaseCon
Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
HBaseCon
Cloudstone - Sharpening Your Weapons Through Big Data
Cloudstone - Sharpening Your Weapons Through Big Data
Christopher Grayson
Performance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Boston
kbajda
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
HostedbyConfluent
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking VN
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
NETWAYS
Stream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NET
confluent
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
OpenStack Paris 2014 - Federation, are we there yet ?
OpenStack Paris 2014 - Federation, are we there yet ?
Tim Bell
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
mason_s
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
Similaire à OpenTSDB for monitoring @ Criteo
(20)
Performance is not an Option - gRPC and Cassandra
Performance is not an Option - gRPC and Cassandra
SD Times - Docker v2
SD Times - Docker v2
Using Databases and Containers From Development to Deployment
Using Databases and Containers From Development to Deployment
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
Cloudstone - Sharpening Your Weapons Through Big Data
Cloudstone - Sharpening Your Weapons Through Big Data
Performance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Boston
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
Stream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NET
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
OpenStack Paris 2014 - Federation, are we there yet ?
OpenStack Paris 2014 - Federation, are we there yet ?
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Dernier
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
masabamasaba
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
WSO2
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
masabamasaba
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
Jittipong Loespradit
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
masabamasaba
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
Presentation.STUDIO
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
masabamasaba
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
masabamasaba
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
Juha-Pekka Tolvanen
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
masabamasaba
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
masabamasaba
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
AmarnathKambale
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
WSO2
Dernier
(20)
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
OpenTSDB for monitoring @ Criteo
1.
Nathaniel Braun Thursday, April
28th, 2016 OpenTSDB for monitoring @ Criteo @
2.
2 | Copyright
© 2016 Criteo •Overview of Hadoop @ Criteo •Our experimental cluster •Rationale for OpenTSDB •Stabilizing & scaling OpenTSDB •OpenTSDB to the rescue in practice Hitch hiker’s guide to this presentation
3.
Overview of Hadoop @
Criteo @
4.
4 | Copyright
© 2016 Criteo Overview of Hadoop @ Criteo Tokyo TY5 – PROD AS Sunnyvale SV6 – PROD NA HongKong HK5 – PROD CN Paris PA4 – PROD / PREPROD Paris PA3 –PREPROD / EXP Amsterdam AM5 – PROD Criteo’s 8 Hadoop clusters – running CDH Community Edition
5.
5 | Copyright
© 2016 Criteo AM5: main production cluster • In use since 2011 • Running CDH3 initially, CDH4 currently • 1118 DataNodes • 13 400+ compute cores • 39 PB of raw disk storage • 105 TB of RAM capacity • 40 TB of data imported every day, mostly through HTTPFS • 100 000+ jobs run daily Overview of Hadoop @ Criteo – Production AM5
6.
6 | Copyright
© 2016 Criteo PA4: comparable to AM5, with fewer machines • Migration done in Q4 2015 – H1 2016 • Running CDH5 • 650+ DataNodes • 15 600+ compute cores • 54 PB of raw disk storage • 143 TB of RAM capacity • Huawei servers (AM5 is HP-based) Overview of Hadoop @ Criteo – Production PA4
7.
7 | Copyright
© 2016 Criteo Criteo has 3 local production Hadoop clusters • Sunnyvale (SV6): 20 nodes • Tokyo (TY5): 35 nodes • Hong Kong (HK5): 20 nodes Overview of Hadoop @ Criteo – Production local clusters
8.
8 | Copyright
© 2016 Criteo Criteo has 3 preproduction Hadoop clusters • Preprod PA3: 54 nodes, running CDH4 • Preprod PA4: 42 nodes, running CDH5 • Experimental: 53 nodes, running CDH5 Overview of Hadoop @ Criteo – Preproduction clusters
9.
9 | Copyright
© 2016 Criteo Overview of Hadoop @ Criteo – Usage Types of jobs running on our clusters • Cascading jobs, mostly for joins between different types of logs (e.g. displays & clicks) • Pure Map/Reduce jobs for recommendation, Hadoop streaming jobs for learning • Scalding jobs for analytics • Hive queries for Business Intelligence • Spark jobs on CDH5
10.
10 | Copyright
© 2016 Criteo Overview of Hadoop @ Criteo – Special consideration • Kerberos for security • High-availability on NameNodes and ResourceManager (CDH5 only) • Infrastructure installed & maintained with Chef
11.
11 | Copyright
© 2016 Criteo Overview of Hadoop @ Criteo How can we monitor this complex infrastructure and services running on top of it?
12.
Our experimental cluster @
13.
13 | Copyright
© 2016 Criteo • Useful for testing infrastructure changes without impacting users (no SLA) • Test environment for new technologies • HBase o Natural joins o OpenTSDB for metrology & monitoring o hRaven for job detailed data (not used anymore) • Spark, now in production @ PA4 Our experimental cluster – Purpose
14.
14 | Copyright
© 2016 Criteo • Based on Google BigTable paper • Integrated with the Hadoop stack • Stores data in rows sorted by row key • Uses regions as an ordered set of rows • Regions sharded by row key bounds • Regions managed by Region servers, collocated with DataNodes (data is stored on HDFS) • Oversize regions split into two regions • Values stored in columns, with no fixed schema as in RDBMS • Columns grouped in column families Our experimental cluster – HBase features
15.
15 | Copyright
© 2016 Criteo Our experimental cluster – HBase architecture Row key (user UID) CF0: user CF1: event C0: IP C2: browser C3: e-mail C0: time C1: type C2: web site AAA value Firefox NULL Click Client #0 BBB value Chrome NULL Click Client #0 CCC value Chrome ccc@mail.com Display Client #1 DDD value IE NULL Sales Client #2 EEE value IE NULL Display Client #0 FFF value IE NULL Display Client #3 ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ XXX value Firefox NULL Sales Client #4 YYY value Chrome NULL Bid Client #5 ZZZ value Opera zzz@mail.com Click Client #5
16.
16 | Copyright
© 2016 Criteo Our experimental cluster – HBase architecture Row key (user UID) CF0: user CF1: event C0: IP C2: browser C3: e-mail C0: time C1: type C2: web site AAA value Firefox NULL Click Client #0 BBB value Chrome NULL Click Client #0 CCC value Chrome ccc@mail.com Display Client #1 DDD value IE NULL Sales Client #2 EEE value IE NULL Display Client #0 FFF value IE NULL Display Client #3 ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ XXX value Firefox NULL Sales Client #4 YYY value Chrome NULL Bid Client #5 ZZZ value Opera zzz@mail.com Click Client #5 R0 R1 R5
17.
17 | Copyright
© 2016 Criteo Our experimental cluster – HBase architecture Row key (user UID) CF0: user CF1: event C0: IP C2: browser C3: e-mail C0: time C1: type C2: web site AAA value Firefox NULL Click Client #0 BBB value Chrome NULL Click Client #0 CCC value Chrome ccc@mail.com Display Client #1 DDD value IE NULL Sales Client #2 EEE value IE NULL Display Client #0 FFF value IE NULL Display Client #3 ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ XXX value Firefox NULL Sales Client #4 YYY value Chrome NULL Bid Client #5 ZZZ value Opera zzz@mail.com Click Client #5 R0 R1 R5 RS1 RS2
18.
18 | Copyright
© 2016 Criteo HBase on the experimental cluster • 50 region servers • 44 000+ regions • ~90 000 requests / second from OpenTSDB Our experimental cluster – HBase @ Criteo
19.
Rationale for OpenTSDB on
20.
20 | Copyright
© 2016 Criteo Metrics to monitor: • CPU load • Processes & threads • RAM available/reserved • Free/used disk space • Network statistics • Sockets open/closed • Open connections with their statuses • Network traffic Rationale for using OpenTSDB – Infrastructure monitoring
21.
21 | Copyright
© 2016 Criteo Rationale for using OpenTSDB – Service monitoring NodeManagers ResourceManagersYARN DataNodes NameNodes JournalNodesHDFS ZooKeeper Kerberos HBase Kafka Storm
22.
22 | Copyright
© 2016 Criteo Rationale for using OpenTSDB – Service monitoring NodeManagers ResourceManagersYARN DataNodes NameNodes JournalNodesHDFS ZooKeeper Kerberos HBase Kafka Storm Huge diversity of services!
23.
23 | Copyright
© 2016 Criteo • Diversity • Many types of nodes & services • Must be extensible simply to add new metrics • Scale • > 2 500 servers • ~ 90 000 requests / second • Storage • Keep fine-grained resolution (down to the minute, at least) • Long-term storage for analysis & investigation Rationale for using OpenTSDB – Scale
24.
24 | Copyright
© 2016 Criteo • Suits the problem well: “Hadoop for monitoring Hadoop” • Designed for time series: HBase schema optimized for time series queries • Scalable and resilient, thanks to HBase • Extensible easily: writing data collector is easy • Simple to query Rationale for using OpenTSDB – Solution
25.
25 | Copyright
© 2016 Criteo Rationale for using OpenTSDB – Easy to query uri = URI.parse("http://0.rtsd.hpc.criteo.preprod:4242/api/query") http = Net::HTTP.start(uri.hostname, uri.port) http.read_timeout = 300 params = { 'start' => '2016/04/21-10:00:00', 'end' => '2016/04/21-12:00:00', 'queries‘ => { 'aggregator' => 'min', 'downsample' => '5m-min', 'metric' => 'hadoop.resourcemanager.queuemetrics.root.AllocatedMB', 'tags' => { 'cluster' => 'ams', 'host' => 'rm.hpc.criteo.prod' } } request = Net::HTTP::Post.new(uri.path, initheader = {'Content-Type' =>'application/json'}) request.body = params.to_json response = http.request(request)
26.
26 | Copyright
© 2016 Criteo Rationale for using OpenTSDB – Practical UI
27.
27 | Copyright
© 2016 Criteo Rationale for using OpenTSDB – Practical UI Metric
28.
28 | Copyright
© 2016 Criteo Rationale for using OpenTSDB – Practical UI Time range Metric
29.
29 | Copyright
© 2016 Criteo Rationale for using OpenTSDB – Practical UI Time range Metric Tag keys/values
30.
30 | Copyright
© 2016 Criteo Rationale for using OpenTSDB – Practical UI Time range Metric Tag keys/values Aggregator
31.
31 | Copyright
© 2016 Criteo • OpenTSDB consists in Time Series Daemons (TSDs) and tcollectors • Some TSDs used for writing, others for reading, while tcollectors collect metrics • TSDs are stateless • TSDs use asyncHBase to scale • Quiz: what are the advantages? Rationale for using OpenTSDB – Design
32.
32 | Copyright
© 2016 Criteo • OpenTSDB consists in Time Series Daemons (TSDs) and tcollectors • Some TSDs used for writing, others for reading, while tcollectors collect metrics • TSDs are stateless • TSDs use asyncHBase to scale • Quiz: what are the advantages? Rationale for using OpenTSDB – Design 1. Clients never interact with HBase directly 2. Simple protocol → easy to use & extend 3. No state, no synchronization → great scalability
33.
33 | Copyright
© 2016 Criteo • Metrics consist in: • metric name • UNIX timestamp • value (64 bit integer or single-precision floating point value). • tags (key-value pairs) specific to that metric instance • Tags useful for aggregations on time series proc.loadavg.15min 1461781436 15 host=0.namenode.hpc.criteo.prod • Charts: average load in 15 minutes with the count aggregator (proxy to machine count) • Quiz: what is the chart below? Rationale for using OpenTSDB – Metrics proc.loadavg.15min
34.
34 | Copyright
© 2016 Criteo • Metrics consist in: • metric name • UNIX timestamp • value (64 bit integer or single-precision floating point value). • tags (key-value pairs) specific to that metric instance • Tags useful for aggregations on time series proc.loadavg.15min 1461781436 15 host=0.namenode.hpc.criteo.prod • Charts: average load in 15 minutes with the count aggregator (proxy to machine count) • Quiz: what is the chart below? Rationale for using OpenTSDB – Metrics proc.loadavg.15min proc.loadavg.15min cluster=*
35.
35 | Copyright
© 2016 Criteo • A single data table (split in regions), named tsdb • Row key: <metric_uid><timestamp><tagk1><tagv1>[...<tagkN><tagvN>] • timestamp is rounded down to the hour • This schema helps group data from the same metric & time bucket close together (HBase sorts rows based on the row key) • Assumption: query first on time range, then metric, then tags, in that order of preference • Tag keys are sorted lexicographically • Tags should be limited, because they are in the row key. Usually less than 5 tags. • Values are stored in columns • Column name: 2 or 4 bytes. For 2 bytes: • Encode offset up to 3 600 seconds → 212 = 4096 → 12 bits • 4 bits left for format/type • Other tables, for metadata and name ↔ ID mappings Rationale for using OpenTSDB – HBase schema
36.
36 | Copyright
© 2016 Criteo Rationale for using OpenTSDB – HBase schema Hexadecimal representation of a row key, with two tags Sorted row keys for the same metric: 000001 Note: row key size varies across rows, because of tags
37.
37 | Copyright
© 2016 Criteo Rationale for using OpenTSDB – Statistics Quiz: what should we look for?
38.
38 | Copyright
© 2016 Criteo Rationale for using OpenTSDB – Statistics Quiz: what should we look for?
39.
39 | Copyright
© 2016 Criteo Rationale for using OpenTSDB – Statistics Quiz: what should we look for? 367 513 metrics 30 tag keys (!) 86 194 tag values
40.
Stabilizing & scaling OpenTSDB
41.
41 | Copyright
© 2016 Criteo OpenTSDB was hard to scale at first. What problem can you see? Scaling OpenTSDB
42.
42 | Copyright
© 2016 Criteo OpenTSDB was hard to scale at first. What problem can you see? Scaling OpenTSDB We’re missing data points
43.
43 | Copyright
© 2016 Criteo • Analyze all the layers of the system • Logs are your friends • Change parameters one by one, not all at once • Measure, change, deploy, measure. Rinse, repeat Scaling OpenTSDB – Lessons learned
44.
44 | Copyright
© 2016 Criteo Varnish & OpenResty save the day Scaling OpenTSDB – Nifty trick OpenResty POST -> GET Varnish Cache + LB OpenResty POST -> GET Varnish Cache + LB OpenResty POST -> GET Varnish Cache + LB RTSD Read OpenTSDB RTSD Read OpenTSDB RTSD Read OpenTSDB
45.
45 | Copyright
© 2016 Criteo Varnish & OpenResty save the day Scaling OpenTSDB – Nifty trick OpenResty POST -> GET Varnish Cache + LB OpenResty POST -> GET Varnish Cache + LB OpenResty POST -> GET Varnish Cache + LB RTSD Read OpenTSDB RTSD Read OpenTSDB RTSD Read OpenTSDB
46.
OpenTSDB to the rescue
in practice
47.
47 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – Easier to use than logs hadoop.namenode.fsnamesystem.tag.HAState
48.
48 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – Easier to use than logs Two NameNode failovers in one night! hadoop.namenode.fsnamesystem.tag.HAState
49.
49 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – Easier to use than logs Two NameNode failovers in one night! • Hard to spot : it in the morning nothing has changed hadoop.namenode.fsnamesystem.tag.HAState
50.
50 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – Easier to use than logs Two NameNode failovers in one night! • Hard to spot : it in the morning nothing has changed • Would be impossible to see with daily aggregation hadoop.namenode.fsnamesystem.tag.HAState
51.
51 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – Easier to use than logs Two NameNode failovers in one night! • Hard to spot : it in the morning nothing has changed • Would be impossible to see with daily aggregation • Trivia: we fixed the tcollector to get that metric hadoop.namenode.fsnamesystem.tag.HAState
52.
52 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – Investigation hadoop.nodemanager.direct.TotalCapacity
53.
53 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – Investigation hadoop.nodemanager.direct.TotalCapacity Huge memory capacity spike
54.
54 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – Investigation hadoop.nodemanager.direct.TotalCapacity Huge memory capacity spike Node not reporting points
55.
55 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – Investigation hadoop.nodemanager.direct.TotalCapacity Huge memory capacity spike Node not reporting points Another huge spike
56.
56 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – Investigation hadoop.nodemanager.direct.TotalCapacity Huge memory capacity spike Node not reporting points Another huge spike No data
57.
57 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – Superimpose charts hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis
58.
58 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – Superimpose charts hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis Service restart – configuration change
59.
59 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – Superimpose charts hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis Service restart – configuration change Service restart – OOM
60.
60 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – Superimpose charts hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis Service restart – configuration change Service restart – OOM Log extract: NodeManager configured with 192 GB physical memory allocated to containers, which is more than 80% of the total physical memory available (89 GB)
61.
61 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – Hiccups hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis
62.
62 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – Hiccups hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis OpenTSDB problem – not node-specific
63.
63 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – Hiccups hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis OpenTSDB problem – not node-specific Node probably dead
64.
64 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – NameNode rescue hadoop.namenode.fsnamesystem.BlocksTotal
65.
65 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – NameNode rescue File deletion File deletion hadoop.namenode.fsnamesystem.BlocksTotal
66.
66 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – NameNode rescue File deletion File deletion File creation hadoop.namenode.fsnamesystem.BlocksTotal
67.
67 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – NameNode rescue hadoop.namenode.fsnamesystem.BlocksTotal hadoop.namenode.fsnamesystem.FilesTotal
68.
68 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – NameNode rescue Slope hadoop.namenode.fsnamesystem.BlocksTotal hadoop.namenode.fsnamesystem.FilesTotal
69.
69 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – NameNode rescue Slope hadoop.namenode.fsnamesystem.BlocksTotal hadoop.namenode.fsnamesystem.FilesTotal Be careful about the scale!
70.
70 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – NameNode rescue hadoop.namenode.fsnamesystemstate.NumLiveDataNodes
71.
71 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – NameNode rescue hadoop.namenode.fsnamesystemstate.NumLiveDataNodes Quiz: what is this pattern?
72.
72 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – NameNode rescue hadoop.namenode.fsnamesystemstate.NumLiveDataNodes Quiz: what is this pattern? • Answer: NameNode checkpoint
73.
73 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – NameNode rescue hadoop.namenode.fsnamesystemstate.NumLiveDataNodes Quiz: what is this pattern? • Answer: NameNode checkpoint • Note: done at regular intervals
74.
74 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – NameNode rescue hadoop.namenode.fsnamesystemstate.NumLiveDataNodes Quiz: what is this pattern? • Answer: NameNode checkpoint • Note: done at regular intervals • Trivia: never do a failover during a checkpoint!
75.
75 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – NameNode rescue hadoop.namenode.fsnamesystemstate.NumLiveDataNodes
76.
76 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – NameNode rescue hadoop.namenode.fsnamesystemstate.NumLiveDataNodes
77.
77 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – NameNode rescue hadoop.namenode.fsnamesystemstate.NumLiveDataNodes Quiz: what is the problem?
78.
78 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – NameNode rescue hadoop.namenode.fsnamesystemstate.NumLiveDataNodes Quiz: what is the problem? • Answer: no NameNode checkpoint → no FS image!
79.
79 | Copyright
© 2016 Criteo OpenTSDB to the rescue in practice – NameNode rescue hadoop.namenode.fsnamesystemstate.NumLiveDataNodes Quiz: what is the problem? • Answer: no NameNode checkpoint → no FS image! • Follow-up: standby namenode could not startup after a failover, because its FS image was too old
80.
80 | Copyright
© 2016 Criteo Criteo ♥ BigData - Very accessible: only 50 euros, which will be given to charity - Speakers from leading organizations: Google, Spotify, Mesosphere, Criteo … https://www.eventbrite.co.uk/e/nabdc-not-another-big-data-conference-registration-24415556587
81.
81 | Copyright
© 2016 Criteo Criteo is hiring! http://labs.criteo.com/ Criteo is hiring!
Télécharger maintenant