Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
www.dataloop.io | @dataloopio | info@dataloop.io
Monitoring for Online Services
What is Dataloop?
PerformanceUp / Down Alerts
Dev Env
Enterprise Stuff
Architecture
First Year
First Year
Measure
Putting out the fire
rollup workermetric worker
Problems
• NodeJS metrics workers not scaling
• Memory management was an issue
• Needed big caches to reduce database load...
Metric worker re-write
• Approximately 6 weeks from no Erlang experience to working version
• No more crashes
• Reduced se...
Today
Happy Ending
Just the beginning!
Initial Instrumentation
› StatsD libraries in Node and Erlang code
› Push UDP packets to a StatsD server for aggregation
Pitfalls
› Metrics increase as service usage increases
› UDP isn’t great
› Aggregates across a service (hard to spot an ou...
Better Instrumentation
› Prometheus http metrics endpoints
› 10 second scrape interval into Dataloop
› Raw data (no loss)
...
Prometheus Output
curl http://localhost/metrics
What to instrument?
› Everything!
› Feature usage
› Throughput
› Error rates
› If it moves instrument it
Analytics
› Simple things like API response times
Analytics
› Pretty useful to plot when a problem started
Yesterday vs. Today
SQL Like Query Language
Time Series Functions
› Create a query to answer questions
Future
› Prediction algorithms
› Search ‘similar’ metrics
› Outlier algorithms
› More functions!
Summary
› Code level metrics with Prometheus are extremely light weight
› Have a framework in place to quickly add more wh...
Q&A
www.dataloop.io
@dataloopio
Prochain SlideShare
Chargement dans…5
×

4

Partager

Télécharger pour lire hors ligne

Analytics driven operations - Steve Acreman - Dataloop

Télécharger pour lire hors ligne

(From the LondonCD meetup on 20 Oct 2016 - http://www.meetup.com/London-Continuous-Delivery/events/231766686/)

Modern infrastructure is becoming increasingly more complex and harder to operate. Trends like containerisation, micro-services and serverless architectures are making it difficult to work out what exactly is happening when problems occur. Most companies are building large distributed systems that were unthinkable only a few years ago. This talk will explain how an analytics monitoring stack will put developers and operations back in the driving seat and given them control back over their uptime.

Livres associés

Gratuit avec un essai de 30 jours de Scribd

Tout voir

Analytics driven operations - Steve Acreman - Dataloop

  1. 1. www.dataloop.io | @dataloopio | info@dataloop.io Monitoring for Online Services
  2. 2. What is Dataloop? PerformanceUp / Down Alerts Dev Env Enterprise Stuff
  3. 3. Architecture
  4. 4. First Year
  5. 5. First Year
  6. 6. Measure
  7. 7. Putting out the fire rollup workermetric worker
  8. 8. Problems • NodeJS metrics workers not scaling • Memory management was an issue • Needed big caches to reduce database load • GC cycles too long • 8 x single processes on an 8 core server
  9. 9. Metric worker re-write • Approximately 6 weeks from no Erlang experience to working version • No more crashes • Reduced servers needed from 16 to 8 • Pushes metrics straight from Rabbit into DalmatinerDB (new database)
  10. 10. Today
  11. 11. Happy Ending
  12. 12. Just the beginning!
  13. 13. Initial Instrumentation › StatsD libraries in Node and Erlang code › Push UDP packets to a StatsD server for aggregation
  14. 14. Pitfalls › Metrics increase as service usage increases › UDP isn’t great › Aggregates across a service (hard to spot an outlier) › Quite lossy
  15. 15. Better Instrumentation › Prometheus http metrics endpoints › 10 second scrape interval into Dataloop › Raw data (no loss) › Dimensions allow drill down into host
  16. 16. Prometheus Output curl http://localhost/metrics
  17. 17. What to instrument? › Everything! › Feature usage › Throughput › Error rates › If it moves instrument it
  18. 18. Analytics › Simple things like API response times
  19. 19. Analytics › Pretty useful to plot when a problem started
  20. 20. Yesterday vs. Today
  21. 21. SQL Like Query Language
  22. 22. Time Series Functions › Create a query to answer questions
  23. 23. Future › Prediction algorithms › Search ‘similar’ metrics › Outlier algorithms › More functions!
  24. 24. Summary › Code level metrics with Prometheus are extremely light weight › Have a framework in place to quickly add more when issues arise › Don’t wait until your first fire to start › Start small and try to get both operations and developers on board
  25. 25. Q&A
  26. 26. www.dataloop.io @dataloopio
  • powerirs

    Sep. 25, 2016
  • SkeltonThatcher

    Sep. 23, 2016
  • matthewskelton

    Sep. 23, 2016
  • londoncd

    Sep. 23, 2016

(From the LondonCD meetup on 20 Oct 2016 - http://www.meetup.com/London-Continuous-Delivery/events/231766686/) Modern infrastructure is becoming increasingly more complex and harder to operate. Trends like containerisation, micro-services and serverless architectures are making it difficult to work out what exactly is happening when problems occur. Most companies are building large distributed systems that were unthinkable only a few years ago. This talk will explain how an analytics monitoring stack will put developers and operations back in the driving seat and given them control back over their uptime.

Vues

Nombre de vues

279

Sur Slideshare

0

À partir des intégrations

0

Nombre d'intégrations

1

Actions

Téléchargements

7

Partages

0

Commentaires

0

Mentions J'aime

4

×