SlideShare une entreprise Scribd logo
1  sur  83
Télécharger pour lire hors ligne
Actionable Metrics
                      Enabling Decision-Making in Netflix’s Decentralized
                                         Environment

                                      Cloud Tech III
                                     October 6, 2012
                                      Roy Rapoport
                               @royrapoport, rsr@netflix.com

Thursday, October 18, 12
Me

                     • Been in tech for about 20 years
                     • Systems engineering, networking, software
                           development, QA, release management
                     • Time at Netflix: 1195 days (3y:3m:1w)
                     • (Current) job at Netflix: Make things better
                           (Security Monkey, Python Platform, Central Alert Gateway, Breaking Stuff.. )




Thursday, October 18, 12
Metrics Humor




Thursday, October 18, 12
Metrics Humor




Thursday, October 18, 12
Metrics Humor




Thursday, October 18, 12
Metrics Humor




Thursday, October 18, 12
Metrics Humor



                       % of instances with even public IP addresses




Thursday, October 18, 12
Technology Overview




Thursday, October 18, 12
Technology Overview
                     • SoA, REST, Mostly Java




Thursday, October 18, 12
Technology Overview
                     • SoA, REST, Mostly Java
                     • Simple overall architecture:




Thursday, October 18, 12
Technology Overview
                     • SoA, REST, Mostly Java
                     • Simple overall architecture:




Thursday, October 18, 12
Technology Overview
                     • SoA, REST, Mostly Java
                     • Simple overall architecture:




Thursday, October 18, 12
Culture Overview




Thursday, October 18, 12
Culture Overview
     • Freedom and
             Responsibility




Thursday, October 18, 12
Culture Overview
     • Freedom and
             Responsibility
     • Distributed
             Operations




Thursday, October 18, 12
Culture Overview
     • Freedom and
             Responsibility
     • Distributed
             Operations
     • Get out of the
             way of
             Developers



Thursday, October 18, 12
The Metric Lifecycle




Thursday, October 18, 12
The Metric Lifecycle

                     •     Send




Thursday, October 18, 12
The Metric Lifecycle

                     •     Send
                     • Look

Thursday, October 18, 12
The Metric Lifecycle

                     •     Send
                     • Look
                     • Alert

Thursday, October 18, 12
Systems

                     • Flexible
                     • Scalable
                     • Self-Service


Thursday, October 18, 12
Telemetry
                             Flexible, Scalable, Self-Service
                   import netflix.metrics
                   [...]
                       self.nm = netflix.metrics.Metrics("core_cag")
                   [...]
                   def api(self):
                       self.nm.nfCounter("api")
                       [...]
                       self.nm.nfCounter(“application_%s” % application)
                   [...]




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Alerting
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Alerting
                           Flexible, Scalable, Self-Service



     • Static vs Dynamic
             Thresholds




Thursday, October 18, 12
Alerting
                           Flexible, Scalable, Self-Service



     • Static vs Dynamic
             Thresholds
     • Compare to
             history




Thursday, October 18, 12
For Example ...
                           Last 3 hours’ core_tools.core_cag_api




                                         What the ...




Thursday, October 18, 12
For Example ...
                                  Visualization (Continued)

                           Last 4 days’ core_tools.core_cag_api




                                    even more questions!



Thursday, October 18, 12
For Example ...
                                   Visualization (Continued)

                           Last 10 days’ core_tools.core_cag_api




                                   What caused the spike?


Thursday, October 18, 12
For Example ...
                                 Visualization (Continued)

                           Show alert volume per application




                             Someone had a rough few days...


Thursday, October 18, 12
Don’t Like Surprises...
                 {
                           "alerts": [
                               {
                                   "applyTo": "cluster",
                                   "condition": {
                                       "minPercent": 90.0,
                                       "noise" : .2,
                                       "maxPercent": 25.0,
                                       "type": "DoubleExponential"
                                   },
                                   "metricName": "core_cag_api",
                                   "severity": "major"
                               }
                           ],
                           "clusters": [
                               "core_tools"
                           ]
                 }




Thursday, October 18, 12
Threshold Tuning


                     • An Abbreviated History ...



Thursday, October 18, 12
Threshold Tuning
                                               (in the beginning)




                    Some priests offer their prayers to alien creatures best left
                    forgotten. This ill-advised worship twists their minds in odd
                    ways. Overlords find these warped men useful due to the
                    unnatural powers they can channel. The dark priests most
                    favored by their strange gods have powerful protections, and
                    defeating one of them is sure to bring down a terrible curse
                    upon the victor.
                      - http://www.descentinthedark.com/_d_/dark_priests.php


Thursday, October 18, 12
Threshold Tuning
                                               (in the beginning)

                     • Systems owned by IT


                    Some priests offer their prayers to alien creatures best left
                    forgotten. This ill-advised worship twists their minds in odd
                    ways. Overlords find these warped men useful due to the
                    unnatural powers they can channel. The dark priests most
                    favored by their strange gods have powerful protections, and
                    defeating one of them is sure to bring down a terrible curse
                    upon the victor.
                      - http://www.descentinthedark.com/_d_/dark_priests.php


Thursday, October 18, 12
Threshold Tuning
                                               (in the beginning)

                     • Systems owned by IT
                     • Want an alert? Submit a ticket

                    Some priests offer their prayers to alien creatures best left
                    forgotten. This ill-advised worship twists their minds in odd
                    ways. Overlords find these warped men useful due to the
                    unnatural powers they can channel. The dark priests most
                    favored by their strange gods have powerful protections, and
                    defeating one of them is sure to bring down a terrible curse
                    upon the victor.
                      - http://www.descentinthedark.com/_d_/dark_priests.php


Thursday, October 18, 12
Threshold Tuning
                                               (in the beginning)

                     • Systems owned by IT
                     • Want an alert? Submit a ticket
                     • Want to tune an alert? Submit a ticket
                    Some priests offer their prayers to alien creatures best left
                    forgotten. This ill-advised worship twists their minds in odd
                    ways. Overlords find these warped men useful due to the
                    unnatural powers they can channel. The dark priests most
                    favored by their strange gods have powerful protections, and
                    defeating one of them is sure to bring down a terrible curse
                    upon the victor.
                      - http://www.descentinthedark.com/_d_/dark_priests.php


Thursday, October 18, 12
Threshold Tuning
                               (It gets better)




Thursday, October 18, 12
Threshold Tuning
                                  (It gets better)

                     • You get to configure your own threshold




Thursday, October 18, 12
Threshold Tuning
                                  (It gets better)

                     • You get to configure your own threshold
                     • Freedom!




Thursday, October 18, 12
Threshold Tuning
                                        (It gets better)

                     • You get to configure your own threshold
                     • Freedom!
                     • Also, you have to configure your own
                           thresholds




Thursday, October 18, 12
Threshold Tuning
                              (Are we there yet?)




Thursday, October 18, 12
Threshold Tuning
                                  (Are we there yet?)

                     • Play with historical data




Thursday, October 18, 12
Threshold Tuning
                                  (Are we there yet?)

                     • Play with historical data
                     • Huge difference




Thursday, October 18, 12
Threshold Tuning
                                  (Are we there yet?)

                     • Play with historical data
                     • Huge difference
                     • Still falls short



Thursday, October 18, 12
Threshold Tuning
                             (Yeah, that’s the ticket)




Thursday, October 18, 12
Threshold Tuning
                               (Yeah, that’s the ticket)

                     • Computers can be good at this




Thursday, October 18, 12
Threshold Tuning
                               (Yeah, that’s the ticket)

                     • Computers can be good at this




Thursday, October 18, 12
Threshold Tuning
                             (Yeah, that’s the ticket)




Thursday, October 18, 12
Threshold Tuning
                               (Yeah, that’s the ticket)

                     • Computers can be good at this




Thursday, October 18, 12
Threshold Tuning
                             (Yeah, that’s the ticket)




Thursday, October 18, 12
Threshold Tuning
                               (Yeah, that’s the ticket)

                     • Computers can be good at this




Thursday, October 18, 12
If Time Allows ...



Thursday, October 18, 12
Events vs Metrics




Thursday, October 18, 12
Events vs Metrics

                     • Irregular Interval




Thursday, October 18, 12
Events vs Metrics

                     • Irregular Interval
                     • Point in time



Thursday, October 18, 12
Events vs Metrics

                     • Irregular Interval
                     • Point in time
                     • Lack magnitude


Thursday, October 18, 12
Why Build It?




Thursday, October 18, 12
Why Build It?

                     • Change management
                           •   Vs Change control




Thursday, October 18, 12
Why Build It?

                     • Change management
                           •   Vs Change control

                     • What Changed?


Thursday, October 18, 12
Why Build It?

                     • Change management
                           •   Vs Change control

                     • What Changed?
                     • Better Alerting

Thursday, October 18, 12
Chronos




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters
                     •     Easy querying




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters   •   Something happened

                     •     Easy querying
                     •     Alarming




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters   •   Something happened

                     •     Easy querying            •   ... X times in Y minutes

                     •     Alarming




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters   •   Something happened

                     •     Easy querying            •   ... X times in Y minutes

                     •     Alarming                 •   Something didn’t happen




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters
                     •     Easy querying
                     •     Alarming
                     •     Medium volume




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters
                     •     Easy querying
                     •     Alarming
                     •     Medium volume
                     •     Recursive
                           •   Recursive



Thursday, October 18, 12
End Result




Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets




Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets
                      • Not talking about SOX or PCI




Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets
                      • Not talking about SOX or PCI
                     • Better visibility into changes



Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets
                      • Not talking about SOX or PCI
                     • Better visibility into changes
                     • Decreased TTR


Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets
                      • Not talking about SOX or PCI
                     • Better visibility into changes
                     • Decreased TTR
                      • Especially for bad code deployments

Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets
                      • Not talking about SOX or PCI
                     • Better visibility into changes
                     • Decreased TTR
                      • Especially for bad code deployments
                     • You should do this
Thursday, October 18, 12
I Didn’t Mention

                     • End-to-end testing and alerting
                     • External availability and performance
                     • Open Connect
                     • Jobs

Thursday, October 18, 12
Questions?




Thursday, October 18, 12

Contenu connexe

En vedette

Canary Analyze All the Things
Canary Analyze All the ThingsCanary Analyze All the Things
Canary Analyze All the Thingsroyrapoport
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...Adrian Cockcroft
 
Traffic anomaly detection and attack
Traffic anomaly detection and attackTraffic anomaly detection and attack
Traffic anomaly detection and attackQrator Labs
 
Anomaly Detection for Security
Anomaly Detection for SecurityAnomaly Detection for Security
Anomaly Detection for SecurityCody Rioux
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteAlois Reitbauer
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSAdrian Cockcroft
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsManojit Nandi
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Parallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysisParallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysisManojit Nandi
 
Monitoring large scale Docker production environments
Monitoring large scale Docker production environmentsMonitoring large scale Docker production environments
Monitoring large scale Docker production environmentsAlois Reitbauer
 
Monitoring without alerts
Monitoring without alertsMonitoring without alerts
Monitoring without alertsAlois Reitbauer
 
The Dark Art of Production Alerting
The Dark Art of Production AlertingThe Dark Art of Production Alerting
The Dark Art of Production AlertingAlois Reitbauer
 
Can a monitoring tool pass the turing test
Can a monitoring tool pass the turing testCan a monitoring tool pass the turing test
Can a monitoring tool pass the turing testAlois Reitbauer
 
The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection. The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection. Alois Reitbauer
 
Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Ruxit - How we launched a global monitoring platform on AWS in 80 days. Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Ruxit - How we launched a global monitoring platform on AWS in 80 days. Alois Reitbauer
 
Monitoring Docker Application in Production
Monitoring Docker Application in ProductionMonitoring Docker Application in Production
Monitoring Docker Application in ProductionAlois Reitbauer
 
Anomaly Detection for Global Scale at Netflix
Anomaly Detection for Global Scale at NetflixAnomaly Detection for Global Scale at Netflix
Anomaly Detection for Global Scale at NetflixExtract Data Conference
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
 

En vedette (19)

Canary Analyze All the Things
Canary Analyze All the ThingsCanary Analyze All the Things
Canary Analyze All the Things
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
 
Traffic anomaly detection and attack
Traffic anomaly detection and attackTraffic anomaly detection and attack
Traffic anomaly detection and attack
 
Anomaly Detection for Security
Anomaly Detection for SecurityAnomaly Detection for Security
Anomaly Detection for Security
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident Syste
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World Systems
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Parallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysisParallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysis
 
Monitoring large scale Docker production environments
Monitoring large scale Docker production environmentsMonitoring large scale Docker production environments
Monitoring large scale Docker production environments
 
Monitoring without alerts
Monitoring without alertsMonitoring without alerts
Monitoring without alerts
 
The Dark Art of Production Alerting
The Dark Art of Production AlertingThe Dark Art of Production Alerting
The Dark Art of Production Alerting
 
Can a monitoring tool pass the turing test
Can a monitoring tool pass the turing testCan a monitoring tool pass the turing test
Can a monitoring tool pass the turing test
 
PyGotham 2016
PyGotham 2016PyGotham 2016
PyGotham 2016
 
The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection. The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection.
 
Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Ruxit - How we launched a global monitoring platform on AWS in 80 days. Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Ruxit - How we launched a global monitoring platform on AWS in 80 days.
 
Monitoring Docker Application in Production
Monitoring Docker Application in ProductionMonitoring Docker Application in Production
Monitoring Docker Application in Production
 
Anomaly Detection for Global Scale at Netflix
Anomaly Detection for Global Scale at NetflixAnomaly Detection for Global Scale at Netflix
Anomaly Detection for Global Scale at Netflix
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
 

Similaire à Cloud Tech III: Actionable Metrics

Falling in Love with Frontend Exception | Devon 2012
Falling in Love with Frontend Exception | Devon 2012Falling in Love with Frontend Exception | Devon 2012
Falling in Love with Frontend Exception | Devon 2012Daum DNA
 
Internship dotCloud
Internship dotCloudInternship dotCloud
Internship dotCloudJill Mee
 
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...Rodrigo Laiola Guimarães
 
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdfOpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdfOpenStack Foundation
 
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"Randy Bias
 
Cloudsearch @ ex.fm
Cloudsearch @ ex.fmCloudsearch @ ex.fm
Cloudsearch @ ex.fm__lucas
 
Java performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJava performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJAX London
 
Migrando do App Engine para o Heroku
Migrando do App Engine para o HerokuMigrando do App Engine para o Heroku
Migrando do App Engine para o HerokuFilipe Ximenes
 
App in the Air - Product Demo (Sep 2012)
App in the Air - Product Demo (Sep 2012)App in the Air - Product Demo (Sep 2012)
App in the Air - Product Demo (Sep 2012)Empatika
 
Retro-Fitting Atlassian Products into a Code-Cowboy Research Culture
Retro-Fitting Atlassian Products into a Code-Cowboy Research CultureRetro-Fitting Atlassian Products into a Code-Cowboy Research Culture
Retro-Fitting Atlassian Products into a Code-Cowboy Research CultureAtlassian
 
Reactive applications using Akka
Reactive applications using AkkaReactive applications using Akka
Reactive applications using AkkaMiguel Pastor
 
Bio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersChris Dagdigian
 
Triage: real-world error logging for web applications
Triage: real-world error logging for web applicationsTriage: real-world error logging for web applications
Triage: real-world error logging for web applicationsLuke Cawood
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkEvan Chan
 
Phpday - Automated acceptance testing with Behat and Mink
Phpday - Automated acceptance testing with Behat and MinkPhpday - Automated acceptance testing with Behat and Mink
Phpday - Automated acceptance testing with Behat and MinkRichard Tuin
 
Secrets of the asset pipeline
Secrets of the asset pipelineSecrets of the asset pipeline
Secrets of the asset pipelineKen Collins
 

Similaire à Cloud Tech III: Actionable Metrics (20)

Falling in Love with Frontend Exception | Devon 2012
Falling in Love with Frontend Exception | Devon 2012Falling in Love with Frontend Exception | Devon 2012
Falling in Love with Frontend Exception | Devon 2012
 
Internship dotCloud
Internship dotCloudInternship dotCloud
Internship dotCloud
 
April JavaScript Tools
April JavaScript ToolsApril JavaScript Tools
April JavaScript Tools
 
What is SCRUM?
What is SCRUM?What is SCRUM?
What is SCRUM?
 
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...
 
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdfOpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
 
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
 
Cloudsearch @ ex.fm
Cloudsearch @ ex.fmCloudsearch @ ex.fm
Cloudsearch @ ex.fm
 
hello-my-name-is-software-testing-v2-pdf
hello-my-name-is-software-testing-v2-pdfhello-my-name-is-software-testing-v2-pdf
hello-my-name-is-software-testing-v2-pdf
 
KubeSecOps
KubeSecOpsKubeSecOps
KubeSecOps
 
Java performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJava performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha Gee
 
Migrando do App Engine para o Heroku
Migrando do App Engine para o HerokuMigrando do App Engine para o Heroku
Migrando do App Engine para o Heroku
 
App in the Air - Product Demo (Sep 2012)
App in the Air - Product Demo (Sep 2012)App in the Air - Product Demo (Sep 2012)
App in the Air - Product Demo (Sep 2012)
 
Retro-Fitting Atlassian Products into a Code-Cowboy Research Culture
Retro-Fitting Atlassian Products into a Code-Cowboy Research CultureRetro-Fitting Atlassian Products into a Code-Cowboy Research Culture
Retro-Fitting Atlassian Products into a Code-Cowboy Research Culture
 
Reactive applications using Akka
Reactive applications using AkkaReactive applications using Akka
Reactive applications using Akka
 
Bio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility Managers
 
Triage: real-world error logging for web applications
Triage: real-world error logging for web applicationsTriage: real-world error logging for web applications
Triage: real-world error logging for web applications
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 
Phpday - Automated acceptance testing with Behat and Mink
Phpday - Automated acceptance testing with Behat and MinkPhpday - Automated acceptance testing with Behat and Mink
Phpday - Automated acceptance testing with Behat and Mink
 
Secrets of the asset pipeline
Secrets of the asset pipelineSecrets of the asset pipeline
Secrets of the asset pipeline
 

Dernier

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Dernier (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Cloud Tech III: Actionable Metrics

  • 1. Actionable Metrics Enabling Decision-Making in Netflix’s Decentralized Environment Cloud Tech III October 6, 2012 Roy Rapoport @royrapoport, rsr@netflix.com Thursday, October 18, 12
  • 2. Me • Been in tech for about 20 years • Systems engineering, networking, software development, QA, release management • Time at Netflix: 1195 days (3y:3m:1w) • (Current) job at Netflix: Make things better (Security Monkey, Python Platform, Central Alert Gateway, Breaking Stuff.. ) Thursday, October 18, 12
  • 7. Metrics Humor % of instances with even public IP addresses Thursday, October 18, 12
  • 9. Technology Overview • SoA, REST, Mostly Java Thursday, October 18, 12
  • 10. Technology Overview • SoA, REST, Mostly Java • Simple overall architecture: Thursday, October 18, 12
  • 11. Technology Overview • SoA, REST, Mostly Java • Simple overall architecture: Thursday, October 18, 12
  • 12. Technology Overview • SoA, REST, Mostly Java • Simple overall architecture: Thursday, October 18, 12
  • 14. Culture Overview • Freedom and Responsibility Thursday, October 18, 12
  • 15. Culture Overview • Freedom and Responsibility • Distributed Operations Thursday, October 18, 12
  • 16. Culture Overview • Freedom and Responsibility • Distributed Operations • Get out of the way of Developers Thursday, October 18, 12
  • 18. The Metric Lifecycle • Send Thursday, October 18, 12
  • 19. The Metric Lifecycle • Send • Look Thursday, October 18, 12
  • 20. The Metric Lifecycle • Send • Look • Alert Thursday, October 18, 12
  • 21. Systems • Flexible • Scalable • Self-Service Thursday, October 18, 12
  • 22. Telemetry Flexible, Scalable, Self-Service import netflix.metrics [...] self.nm = netflix.metrics.Metrics("core_cag") [...] def api(self): self.nm.nfCounter("api") [...] self.nm.nfCounter(“application_%s” % application) [...] Thursday, October 18, 12
  • 23. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 24. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 25. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 26. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 27. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 28. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 29. Alerting Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 30. Alerting Flexible, Scalable, Self-Service • Static vs Dynamic Thresholds Thursday, October 18, 12
  • 31. Alerting Flexible, Scalable, Self-Service • Static vs Dynamic Thresholds • Compare to history Thursday, October 18, 12
  • 32. For Example ... Last 3 hours’ core_tools.core_cag_api What the ... Thursday, October 18, 12
  • 33. For Example ... Visualization (Continued) Last 4 days’ core_tools.core_cag_api even more questions! Thursday, October 18, 12
  • 34. For Example ... Visualization (Continued) Last 10 days’ core_tools.core_cag_api What caused the spike? Thursday, October 18, 12
  • 35. For Example ... Visualization (Continued) Show alert volume per application Someone had a rough few days... Thursday, October 18, 12
  • 36. Don’t Like Surprises... { "alerts": [ { "applyTo": "cluster", "condition": { "minPercent": 90.0, "noise" : .2, "maxPercent": 25.0, "type": "DoubleExponential" }, "metricName": "core_cag_api", "severity": "major" } ], "clusters": [ "core_tools" ] } Thursday, October 18, 12
  • 37. Threshold Tuning • An Abbreviated History ... Thursday, October 18, 12
  • 38. Threshold Tuning (in the beginning) Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor. - http://www.descentinthedark.com/_d_/dark_priests.php Thursday, October 18, 12
  • 39. Threshold Tuning (in the beginning) • Systems owned by IT Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor. - http://www.descentinthedark.com/_d_/dark_priests.php Thursday, October 18, 12
  • 40. Threshold Tuning (in the beginning) • Systems owned by IT • Want an alert? Submit a ticket Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor. - http://www.descentinthedark.com/_d_/dark_priests.php Thursday, October 18, 12
  • 41. Threshold Tuning (in the beginning) • Systems owned by IT • Want an alert? Submit a ticket • Want to tune an alert? Submit a ticket Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor. - http://www.descentinthedark.com/_d_/dark_priests.php Thursday, October 18, 12
  • 42. Threshold Tuning (It gets better) Thursday, October 18, 12
  • 43. Threshold Tuning (It gets better) • You get to configure your own threshold Thursday, October 18, 12
  • 44. Threshold Tuning (It gets better) • You get to configure your own threshold • Freedom! Thursday, October 18, 12
  • 45. Threshold Tuning (It gets better) • You get to configure your own threshold • Freedom! • Also, you have to configure your own thresholds Thursday, October 18, 12
  • 46. Threshold Tuning (Are we there yet?) Thursday, October 18, 12
  • 47. Threshold Tuning (Are we there yet?) • Play with historical data Thursday, October 18, 12
  • 48. Threshold Tuning (Are we there yet?) • Play with historical data • Huge difference Thursday, October 18, 12
  • 49. Threshold Tuning (Are we there yet?) • Play with historical data • Huge difference • Still falls short Thursday, October 18, 12
  • 50. Threshold Tuning (Yeah, that’s the ticket) Thursday, October 18, 12
  • 51. Threshold Tuning (Yeah, that’s the ticket) • Computers can be good at this Thursday, October 18, 12
  • 52. Threshold Tuning (Yeah, that’s the ticket) • Computers can be good at this Thursday, October 18, 12
  • 53. Threshold Tuning (Yeah, that’s the ticket) Thursday, October 18, 12
  • 54. Threshold Tuning (Yeah, that’s the ticket) • Computers can be good at this Thursday, October 18, 12
  • 55. Threshold Tuning (Yeah, that’s the ticket) Thursday, October 18, 12
  • 56. Threshold Tuning (Yeah, that’s the ticket) • Computers can be good at this Thursday, October 18, 12
  • 57. If Time Allows ... Thursday, October 18, 12
  • 58. Events vs Metrics Thursday, October 18, 12
  • 59. Events vs Metrics • Irregular Interval Thursday, October 18, 12
  • 60. Events vs Metrics • Irregular Interval • Point in time Thursday, October 18, 12
  • 61. Events vs Metrics • Irregular Interval • Point in time • Lack magnitude Thursday, October 18, 12
  • 62. Why Build It? Thursday, October 18, 12
  • 63. Why Build It? • Change management • Vs Change control Thursday, October 18, 12
  • 64. Why Build It? • Change management • Vs Change control • What Changed? Thursday, October 18, 12
  • 65. Why Build It? • Change management • Vs Change control • What Changed? • Better Alerting Thursday, October 18, 12
  • 67. Chronos • Rapidly Prototyped Thursday, October 18, 12
  • 68. Chronos • Rapidly Prototyped • Adapters and reporters Thursday, October 18, 12
  • 69. Chronos • Rapidly Prototyped • Adapters and reporters • Easy querying Thursday, October 18, 12
  • 70. Chronos • Rapidly Prototyped • Adapters and reporters • Something happened • Easy querying • Alarming Thursday, October 18, 12
  • 71. Chronos • Rapidly Prototyped • Adapters and reporters • Something happened • Easy querying • ... X times in Y minutes • Alarming Thursday, October 18, 12
  • 72. Chronos • Rapidly Prototyped • Adapters and reporters • Something happened • Easy querying • ... X times in Y minutes • Alarming • Something didn’t happen Thursday, October 18, 12
  • 73. Chronos • Rapidly Prototyped • Adapters and reporters • Easy querying • Alarming • Medium volume Thursday, October 18, 12
  • 74. Chronos • Rapidly Prototyped • Adapters and reporters • Easy querying • Alarming • Medium volume • Recursive • Recursive Thursday, October 18, 12
  • 76. End Result • Massive decrease in change control tickets Thursday, October 18, 12
  • 77. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI Thursday, October 18, 12
  • 78. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI • Better visibility into changes Thursday, October 18, 12
  • 79. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI • Better visibility into changes • Decreased TTR Thursday, October 18, 12
  • 80. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI • Better visibility into changes • Decreased TTR • Especially for bad code deployments Thursday, October 18, 12
  • 81. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI • Better visibility into changes • Decreased TTR • Especially for bad code deployments • You should do this Thursday, October 18, 12
  • 82. I Didn’t Mention • End-to-end testing and alerting • External availability and performance • Open Connect • Jobs Thursday, October 18, 12