SlideShare une entreprise Scribd logo
1  sur  91
Télécharger pour lire hors ligne
METRICS-DRIVEN
                 ENGINEERING at
                      Kellan Elliott-McCrea, VP of Eng.
                           kellan@etsy.com @kellan




Tuesday, June 5, 12
Tuesday, June 5, 12
Tuesday, June 5, 12
What is Etsy?



Tuesday, June 5, 12
8.5+ million items
                      in the marketplace




Tuesday, June 5, 12
400,000+ active




Tuesday, June 5, 12
$300+ million in
                        sales in 2010

                      ~$41 million/month


Tuesday, June 5, 12
> $1000 / minute



Tuesday, June 5, 12
> 1 billion page
                      views / month


Tuesday, June 5, 12
business in over
                       150 countries


Tuesday, June 5, 12
deploy the site,
                      every ~20 minutes


Tuesday, June 5, 12
engineering team
                            grew
                        ~4x in 2010


Tuesday, June 5, 12
Metrics?



Tuesday, June 5, 12
Logs, Graphs,
                          Trends,
                      and Correlations


Tuesday, June 5, 12
Metrics Driven?



Tuesday, June 5, 12
Making Decisions



Tuesday, June 5, 12
How many visitors
                              are
                       using this thing?


Tuesday, June 5, 12
Can we deploy that
                       to
              100% of our visitors?


Tuesday, June 5, 12
Did we make it
                          faster?


Tuesday, June 5, 12
Did I just break
                        something?


Tuesday, June 5, 12
Q.  WHO MAKES THESE
                             GRAPHS?
           A. Well,racksOps team manages thethe
            network,
                     the
                         the servers, installed
                      monitoring tools, wears the pagers,
                              blah, blah, blah...




Tuesday, June 5, 12
but... Engineers
                            build
                      the application.


Tuesday, June 5, 12
Dev + Ops


Tuesday, June 5, 12
ACCESS


Tuesday, June 5, 12
Yes!   No.




Tuesday, June 5, 12
“Engineers are
                        too busy!”


Tuesday, June 5, 12
Here’s the BIG
                        SECRET...


Tuesday, June 5, 12
... MAKE IT EASY!



Tuesday, June 5, 12
Simple, open
                      source tools


Tuesday, June 5, 12
Cacti (network, SNMP)
                      Ganglia (machines)
                      Graphite (application)
                      Splunk (log analysis, nightly
                      reports)
                      Nagios (alerting)



Tuesday, June 5, 12
Gan
                ★cluster oriented
                ★huge community contributed
                recipes
                ★2.0 released today (including
                several Flickr and Etsy patches!)
                ★gmetad makes it easy to track
                custom metrics


Tuesday, June 5, 12
Tuesday, June 5, 12
Graphite
                ★super flexible collection and
                display
                ★per metrics buckets
                ★single instance
                ★super easy to write and use
                custom display functions



Tuesday, June 5, 12
Logging


Tuesday, June 5, 12
Logger::log_error("User login
                        failed. Reason: $msg for
                          $username", “login”);




Tuesday, June 5, 12
web0054 [Fri Mar 04 16:27:48
                      2011] [error] [login] [14531658]
                      User login failed. Reason: wrong
                              password for ...




Tuesday, June 5, 12
web0054 [Fri Mar 04 16:27:48
                      2011] [error] [login] [14531658]
                      User login failed. Reason: wrong
                              password for ...




Tuesday, June 5, 12
web0054 [Fri Mar 04 16:27:48
                      2011] [error] [login] [14531658]
                      User login failed. Reason: wrong
                              password for ...




Tuesday, June 5, 12
web0054 [Fri Mar 04 16:27:48
                      2011] [info] [login] [14531658]
                      User login failed. Reason: wrong
                              password for ...




Tuesday, June 5, 12
web0054 [Fri Mar 04 16:27:48
                      2011] [info] [login] [14531658]
                      User login failed. Reason: wrong
                              password for ...




Tuesday, June 5, 12
web0054 [Fri Mar 04 16:27:48
                      2011] [info] [login] [14531658]
                      User login failed. Reason: wrong
                              password for ...




Tuesday, June 5, 12
Counting
                      and Timing
                      http://code.flickr.com/blog/
                      2008/10/27/counting-timing/




Tuesday, June 5, 12
Logster


Tuesday, June 5, 12
Logster
                      https://github.com/etsy/logster




Tuesday, June 5, 12
Forked from ganglia-logtailer :

                            - Daemon mode
                (only cron mode)
                            + Support for
                Graphite
                            + Simplified parsing
                scripts




Tuesday, June 5, 12
web0001        [04:28:54   2011]   [warning] [client 10.101.x.x] Gaaaaahhh!
       web0001        [04:28:54   2011]   [error] [client 10.101.x.x] Help me, Rhonda.
       web0001        [04:28:54   2011]   [error] [client 10.101.x.x] Oh noooooo!
       web0001        [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!
       web0001        [04:28:54   2011]   [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp!
       web0001        [04:28:54   2011]   [error] [client 10.101.x.x] Oh noooooo!
       web0001        [04:28:54   2011]   [fatal] [client 10.101.x.x] Gaaaaahhh!
       web0201        [04:28:54   2011]   [warning] [client 10.101.x.x] Gaaaaahhh!
       web0034        [04:28:54   2011]   [warning] [client 10.101.x.x] Oh nooooooooooo
       web0001        [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
       web1101        [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
       web0201        [04:28:54   2011]   [error] [client 10.101.x.x] You've been eaten by a grue.
       web0055        [04:28:54   2011]   [fatal] [client 10.101.x.x] Gaaaaahhh!!!
       web0002        [04:28:54   2011]   [warning] [client 10.101.x.x] Sky is falling.
       web0089        [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
       web0020        [04:28:54   2011]   [error] [client 10.101.x.x] Sky is falling.
       web1101        [04:28:54   2011]   [fatal] [client 10.101.x.x] Gaaaaahhh!
       web0055        [04:28:54   2011]   [warning] [client 10.101.x.x] Gaaaaahhh!
       web0001        [04:28:54   2011]   [warning] [client 10.101.x.x] Oh nooooooooooo
       web0001        [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
       web0034        [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
       web0087        [04:28:54   2011]   [fatal] [client 10.101.x.x] Sky is falling.
       web0002        [04:28:54   2011]   [error] [client 10.101.x.x] Oh noooooo!
       web0201        [04:28:54   2011]   [fatal] [client 10.101.x.x] Gaaaaahhh!
       web0077        [04:28:54   2011]   [warning] [client 10.101.x.x] Gaaaaahhh!
       web0355        [04:28:54   2011]   [warning] [client 10.101.x.x] Oh nooooooooooo
       web0052        [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
       web0001        [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
       web0003        [04:28:54   2011]   [error] [client 10.101.x.x] You've been eaten by a grue.
       web0066        [04:28:54   2011]   [fatal] [client 10.101.x.x] Gaaaaahhh!!!
       web0001        [04:28:54   2011]   [warning] [client 10.101.x.x] Sky is falling
Tuesday, June 5, 12
Fatals   Errors   Warnings




Tuesday, June 5, 12
★runs out of cron
                ★maintains a cursor into log files
                ★supports ganglia and graphite
                ★custom parsers much easier to
                write then gmetad




Tuesday, June 5, 12
Apache access logs


Tuesday, June 5, 12
LogFormat "%h %l %u %t "%r"
                  %>s %b" common




Tuesday, June 5, 12
LogFormat "%{X-Forwarded-For}i %
             {True-Client-IP}i %l %u %t "%r" %>s %b
                "%{Referer}i" "%{User-Agent}i" %
                {etsy_shop_id}n %{etsy_uaid}n %V %
                       {etsy_ab_selections}n %
                       {etsy_request_uuid}n %
                    {etsy_api_consumer_key}n %
                    {etsy_api_method_name}n %
                  {php_memory_usage_bytes}n %
               {php_time_microsec}n %D" combined

Tuesday, June 5, 12
%{etsy_ab_selections}n




Tuesday, June 5, 12
%{etsy_uaid}n




Tuesday, June 5, 12
Graphs


Tuesday, June 5, 12
“If Engineering at Etsy has
        a religion, it’s the Church
        of Graphs. If it moves, we
          track it.” - Erik Kastner

   http://codeascraft.etsy.com/2011/02/15/measure-
   anything-measure-everything/




Tuesday, June 5, 12
Tuesday, June 5, 12
StatsD


Tuesday, June 5, 12
StatsD
                        https://github.com/
                        etsy/statsd/




Tuesday, June 5, 12
StatsD::increment("logins.success");
       StatsD::timing("gearman.time", $msec);




Tuesday, June 5, 12
90th pct

                                    average
                                    lower


       StatsD::timing("gearman.time", $msec);




Tuesday, June 5, 12
Ad hoc
                      name value timestamp




Tuesday, June 5, 12
echo "events.deploy.site 1 `date +%s`" 
              | nc graphite.etsycorp.com 2003




Tuesday, June 5, 12
Correlations



Tuesday, June 5, 12
echo "events.deploy.site 1 `date +%s`" 
              | nc graphite.etsycorp.com 2003




Tuesday, June 5, 12
Trends + Events
         target=drawAsInfinite(events.deploy.site)




Tuesday, June 5, 12
What Happened?


Tuesday, June 5, 12
Holt-Winters


Tuesday, June 5, 12
"Forecasting Sales by
                      Exponentially Weighted
                      Moving Averages". Peter



Tuesday, June 5, 12
"Aberrant Behavior
                      Detection in Time Series
                      for Network Monitoring".



Tuesday, June 5, 12
"Holt-Winters Forecasting
                      Applied to Poisson
                   Processes in Real-Time".



Tuesday, June 5, 12
holtWintersConfidence(Upper|Lower)




Tuesday, June 5, 12
holtWintersAberration




Tuesday, June 5, 12
business metrics with
             confidence bands
                    ==
        alertable business metrics


Tuesday, June 5, 12
16,000 metrics in
                           GRAPHITE
                      (plus 32,000 metrics in GANGLIA)




Tuesday, June 5, 12
16,000 metrics in
                           GRAPHITE
                      (plus 32,000 metrics in GANGLIA)




Tuesday, June 5, 12
Dashboards


Tuesday, June 5, 12
Dashboards



Tuesday, June 5, 12
Dashboards



Tuesday, June 5, 12
Hard
       <a href="http://graphite.etsycorp.com/render?
       from=-1hours&width=800&height=600&title=File+or+Script+Not
       +Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite
       %28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production
       %29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite
       %28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,
       %23ff0000,%23006633,%23cc6600">
       
   <img src="http://graphite.etsycorp.com/render?
       from=-1hours&width=280&height=220&title=File+or+Script+Not
       +Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite
       %28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production
       %29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite
       %28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,
       %23ff0000,%23006633,%23cc6600">
       </a>




Tuesday, June 5, 12
Easy!
     $g = new Graphite($time);
     $g->setTitle('File Not Found');
     $g->addMetric('webs.errorLog.notExist', '#00cc00');
     $g->showDeploys(true);
     echo $g->getDashboardHTML(280, 220);




Tuesday, June 5, 12
48 dashboards by
                        32 engineers


Tuesday, June 5, 12
Application
                        health


Tuesday, June 5, 12
High-level
                       visibility


Tuesday, June 5, 12
Low MTTD


Tuesday, June 5, 12
Confidence


Tuesday, June 5, 12
Make metrics


Tuesday, June 5, 12
Make metrics


Tuesday, June 5, 12
Make metrics


Tuesday, June 5, 12
Not that much


Tuesday, June 5, 12
codeascraft.etsy.com
                      github.com/etsy/statsd
                      github.com/etsy/logster

                      bitbucket.org/maplebed/ganglia-
                      logtailer




Tuesday, June 5, 12
Questions?




Tuesday, June 5, 12

Contenu connexe

Plus de Kellan

Architecting for Change: QCONNYC 2012
Architecting for Change: QCONNYC 2012Architecting for Change: QCONNYC 2012
Architecting for Change: QCONNYC 2012Kellan
 
Engineering Change
Engineering ChangeEngineering Change
Engineering ChangeKellan
 
Solving the "Brooklyn Problem"
Solving the "Brooklyn Problem" Solving the "Brooklyn Problem"
Solving the "Brooklyn Problem" Kellan
 
Social Software For Robots
Social Software For RobotsSocial Software For Robots
Social Software For RobotsKellan
 
Beyond REST? Building data services with XMPP
Beyond REST? Building data services with XMPPBeyond REST? Building data services with XMPP
Beyond REST? Building data services with XMPPKellan
 
Advanced OAuth Wrangling
Advanced OAuth WranglingAdvanced OAuth Wrangling
Advanced OAuth WranglingKellan
 
Casual Privacy (Ignite Web2.0 Expo)
Casual Privacy (Ignite Web2.0 Expo)Casual Privacy (Ignite Web2.0 Expo)
Casual Privacy (Ignite Web2.0 Expo)Kellan
 

Plus de Kellan (7)

Architecting for Change: QCONNYC 2012
Architecting for Change: QCONNYC 2012Architecting for Change: QCONNYC 2012
Architecting for Change: QCONNYC 2012
 
Engineering Change
Engineering ChangeEngineering Change
Engineering Change
 
Solving the "Brooklyn Problem"
Solving the "Brooklyn Problem" Solving the "Brooklyn Problem"
Solving the "Brooklyn Problem"
 
Social Software For Robots
Social Software For RobotsSocial Software For Robots
Social Software For Robots
 
Beyond REST? Building data services with XMPP
Beyond REST? Building data services with XMPPBeyond REST? Building data services with XMPP
Beyond REST? Building data services with XMPP
 
Advanced OAuth Wrangling
Advanced OAuth WranglingAdvanced OAuth Wrangling
Advanced OAuth Wrangling
 
Casual Privacy (Ignite Web2.0 Expo)
Casual Privacy (Ignite Web2.0 Expo)Casual Privacy (Ignite Web2.0 Expo)
Casual Privacy (Ignite Web2.0 Expo)
 

Dernier

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 

Dernier (20)

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 

Metrics driven engineering (velocity 2011)

  • 1. METRICS-DRIVEN ENGINEERING at Kellan Elliott-McCrea, VP of Eng. kellan@etsy.com @kellan Tuesday, June 5, 12
  • 5. 8.5+ million items in the marketplace Tuesday, June 5, 12
  • 7. $300+ million in sales in 2010 ~$41 million/month Tuesday, June 5, 12
  • 8. > $1000 / minute Tuesday, June 5, 12
  • 9. > 1 billion page views / month Tuesday, June 5, 12
  • 10. business in over 150 countries Tuesday, June 5, 12
  • 11. deploy the site, every ~20 minutes Tuesday, June 5, 12
  • 12. engineering team grew ~4x in 2010 Tuesday, June 5, 12
  • 14. Logs, Graphs, Trends, and Correlations Tuesday, June 5, 12
  • 17. How many visitors are using this thing? Tuesday, June 5, 12
  • 18. Can we deploy that to 100% of our visitors? Tuesday, June 5, 12
  • 19. Did we make it faster? Tuesday, June 5, 12
  • 20. Did I just break something? Tuesday, June 5, 12
  • 21. Q. WHO MAKES THESE GRAPHS? A. Well,racksOps team manages thethe network, the the servers, installed monitoring tools, wears the pagers, blah, blah, blah... Tuesday, June 5, 12
  • 22. but... Engineers build the application. Tuesday, June 5, 12
  • 23. Dev + Ops Tuesday, June 5, 12
  • 25. Yes! No. Tuesday, June 5, 12
  • 26. “Engineers are too busy!” Tuesday, June 5, 12
  • 27. Here’s the BIG SECRET... Tuesday, June 5, 12
  • 28. ... MAKE IT EASY! Tuesday, June 5, 12
  • 29. Simple, open source tools Tuesday, June 5, 12
  • 30. Cacti (network, SNMP) Ganglia (machines) Graphite (application) Splunk (log analysis, nightly reports) Nagios (alerting) Tuesday, June 5, 12
  • 31. Gan ★cluster oriented ★huge community contributed recipes ★2.0 released today (including several Flickr and Etsy patches!) ★gmetad makes it easy to track custom metrics Tuesday, June 5, 12
  • 33. Graphite ★super flexible collection and display ★per metrics buckets ★single instance ★super easy to write and use custom display functions Tuesday, June 5, 12
  • 35. Logger::log_error("User login failed. Reason: $msg for $username", “login”); Tuesday, June 5, 12
  • 36. web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [14531658] User login failed. Reason: wrong password for ... Tuesday, June 5, 12
  • 37. web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [14531658] User login failed. Reason: wrong password for ... Tuesday, June 5, 12
  • 38. web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [14531658] User login failed. Reason: wrong password for ... Tuesday, June 5, 12
  • 39. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] [14531658] User login failed. Reason: wrong password for ... Tuesday, June 5, 12
  • 40. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] [14531658] User login failed. Reason: wrong password for ... Tuesday, June 5, 12
  • 41. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] [14531658] User login failed. Reason: wrong password for ... Tuesday, June 5, 12
  • 42. Counting and Timing http://code.flickr.com/blog/ 2008/10/27/counting-timing/ Tuesday, June 5, 12
  • 44. Logster https://github.com/etsy/logster Tuesday, June 5, 12
  • 45. Forked from ganglia-logtailer : - Daemon mode (only cron mode) + Support for Graphite + Simplified parsing scripts Tuesday, June 5, 12
  • 46. web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Help me, Rhonda. web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0001 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0201 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0034 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web1101 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0201 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue. web0055 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!! web0002 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling. web0089 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0020 [04:28:54 2011] [error] [client 10.101.x.x] Sky is falling. web1101 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0055 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0034 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0087 [04:28:54 2011] [fatal] [client 10.101.x.x] Sky is falling. web0002 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0201 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0077 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0355 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0052 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0003 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue. web0066 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!! web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling Tuesday, June 5, 12
  • 47. Fatals Errors Warnings Tuesday, June 5, 12
  • 48. ★runs out of cron ★maintains a cursor into log files ★supports ganglia and graphite ★custom parsers much easier to write then gmetad Tuesday, June 5, 12
  • 50. LogFormat "%h %l %u %t "%r" %>s %b" common Tuesday, June 5, 12
  • 51. LogFormat "%{X-Forwarded-For}i % {True-Client-IP}i %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" % {etsy_shop_id}n %{etsy_uaid}n %V % {etsy_ab_selections}n % {etsy_request_uuid}n % {etsy_api_consumer_key}n % {etsy_api_method_name}n % {php_memory_usage_bytes}n % {php_time_microsec}n %D" combined Tuesday, June 5, 12
  • 55. “If Engineering at Etsy has a religion, it’s the Church of Graphs. If it moves, we track it.” - Erik Kastner http://codeascraft.etsy.com/2011/02/15/measure- anything-measure-everything/ Tuesday, June 5, 12
  • 58. StatsD https://github.com/ etsy/statsd/ Tuesday, June 5, 12
  • 59. StatsD::increment("logins.success"); StatsD::timing("gearman.time", $msec); Tuesday, June 5, 12
  • 60. 90th pct average lower StatsD::timing("gearman.time", $msec); Tuesday, June 5, 12
  • 61. Ad hoc name value timestamp Tuesday, June 5, 12
  • 62. echo "events.deploy.site 1 `date +%s`" | nc graphite.etsycorp.com 2003 Tuesday, June 5, 12
  • 64. echo "events.deploy.site 1 `date +%s`" | nc graphite.etsycorp.com 2003 Tuesday, June 5, 12
  • 65. Trends + Events target=drawAsInfinite(events.deploy.site) Tuesday, June 5, 12
  • 68. "Forecasting Sales by Exponentially Weighted Moving Averages". Peter Tuesday, June 5, 12
  • 69. "Aberrant Behavior Detection in Time Series for Network Monitoring". Tuesday, June 5, 12
  • 70. "Holt-Winters Forecasting Applied to Poisson Processes in Real-Time". Tuesday, June 5, 12
  • 73. business metrics with confidence bands == alertable business metrics Tuesday, June 5, 12
  • 74. 16,000 metrics in GRAPHITE (plus 32,000 metrics in GANGLIA) Tuesday, June 5, 12
  • 75. 16,000 metrics in GRAPHITE (plus 32,000 metrics in GANGLIA) Tuesday, June 5, 12
  • 79. Hard <a href="http://graphite.etsycorp.com/render? from=-1hours&width=800&height=600&title=File+or+Script+Not +Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite %28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production %29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite %28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff, %23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render? from=-1hours&width=280&height=220&title=File+or+Script+Not +Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite %28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production %29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite %28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff, %23ff0000,%23006633,%23cc6600"> </a> Tuesday, June 5, 12
  • 80. Easy! $g = new Graphite($time); $g->setTitle('File Not Found'); $g->addMetric('webs.errorLog.notExist', '#00cc00'); $g->showDeploys(true); echo $g->getDashboardHTML(280, 220); Tuesday, June 5, 12
  • 81. 48 dashboards by 32 engineers Tuesday, June 5, 12
  • 82. Application health Tuesday, June 5, 12
  • 83. High-level visibility Tuesday, June 5, 12
  • 90. codeascraft.etsy.com github.com/etsy/statsd github.com/etsy/logster bitbucket.org/maplebed/ganglia- logtailer Tuesday, June 5, 12