SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Skynet Project
Monitor, analyze, scale, and maintain an infrastructure in the Cloud
Or everything humans should not do
@SylvainKalache
What
about me?
•Operations Engineer @SlideShare since 2011
•Fan of system automation
•@SylvainKalache on Twitter
The Cloud
The cloud is now everywhere, it's elastic, dynamic and far from the rigid infrastructures what we had before.
Most of the modern architecture are using "The Cloud" because give so much possibilities that no one can ignore it.
What about our tools?
But what about our tools? Are they designed to deal with the cloud?
Nagios and Ganglia, for example, are obviously not designed for. What are the alternative? Not much heh?
Of course we could have use Cloudwatch because we are using EC2, but you the possibilities are still limited in term of
granularity, personalization and you hit a limit at some point.
When these same tools does their job, something break, Nagios complains, Ganglia show the problem in its graph...
And then what? Nothing, we need human intervention. Why?
Automation?
Why humans, still do repetitive things, have to react to events that system could handle?
Why we aren't we automating the management of systems?
What brought Skynet to life?
Slideshare has an infrastructure hosted in EC2 for the document conversion.
We have a lot of "factories" which are converting different types of documents.
This infrastructure scale up and down on demand.
Old scaling script
Our all scaling process was divided in 3 parts: 1. A Bash script to launch and configure instances 2. Monit and Ruby code to maintain the
code running.
3. Another Ruby script to shutdown the server.
These 3 pieces of code were not communicating between each other and were not failure safe.
•Idle
•Immortal
•Wasting money
Zombie instances
Meaning that we ended up with instances not running code.
That would not shutdown themselves and were doing nothing but wasting money.
•No metrics - blind
•Wasting time fixing
Mummy Ops
For Ops we had issue with no visibility about what was going on with our infrastructure (we were only monitoring the SQS queue).
We ended up wasting so much time investigating and then eventually fixing problem.
There was also a lack of feedback for the developers working on the conversion code, are we converting faster? Better?
With Skynet Controller
The idea was to create a Controller that would manage the whole instance life time but not only.
It would scale inteligently based on the current state of the system but also based on trend that we can generate by using
historical data.
Skynet architecture
The idea of Skynet was to make something flexible so that we could use it elsewhere.
It should not be architecture specifically for our problem but for any system.
Why not open sourcing it at some point?
Collectors: Ruby daemons for collecting system metrics and a gem for application logs.
Log collection: Fluentd
Datastore: MongoDB
Query API: Ruby + Sinatra
Controller: Ruby + MCollective + Puppet + Fog
Dashboard: Investigation and monitoring tools
Collect system metrics
Collect applications logs
Collectors
Collect system metrics via a Ruby daemon running on each machine, we can collect any metrics via plugins.
Collect application logs via a gem.
Those 2 components send data to Fluentd locally.
Fluentd
•Light
•Written in Ruby
•Handle failure
•Plugins
•Local forwarder
•Aggregation & routing
•Stream processing
Why? What?
Fluentd is in charge of collecting the log, routing and carrying them to their endpoint.
It handle failure properly (failover to over node + will backup log if next hope is down)
Written in Ruby, super light.
The advantage is also the infrastructure, any output/input is managed via plugin that you create/customize without have to mess with the Fluentd core.
•Schema less
•Key/value match log
format
•Store system metrics &
application logs
•Jobs metadata
Why? What?
MongoDB is schema less, which give us the possibility to make our data schema evolve very easily.
MongoDB is fiting-ish the bill for the format 1 log entry one Mongo document.
We store system metrics and informations (CPU, memory, top output, disk, Nginx active connection...) but also application
logs.
Abstraction of the datastore
Easy REST interface
Keep control over requests processed
Post-request computing via plugin
API
The abstraction layer/REST API give use the possibility to use any datastore in the backend, we are now using MongoDB but we could use Elastic search in front of it later.
Also depending of the type data, we store them in different MongoDB cluster, but this is totally transparent to the data consumer.
We keep control over whatʼs possible to do so that a single data consumer does not crash the whole datastore.
We offer the possibility to have computing plugin to process the data right on the API server before returning them to the requester. Better to process data as near as by the
source to avoid having to transport big chunk of data over network.
Monitor all the things
Granular & global view
Investigation tool
Dashboard
We can monitor all kind of stuff, possibilities are infinite with the combination of the Rubym daemon and the gem.
We can use all these data to have an global overview of our infrastructure but also have granularity and see that at this
particular second there was a CPU spike for a job on this machine converting this document.
It's also a useful debugging tool.
Automate
Scale
Fix
Alert
Controller
Can use all the data collected to scale, based on current state but also trends.
Can take action if a system is in a abnormal state try to fix it via possibility trees.
And finally Skynet Controller can also alert, in case the system reach its limit and finally need a human, again...!
Want to read more about it?
!
Check out the blog post!
!
http://engineering.slideshare.net/2014/04/skynet-project-monitor-scale-and-auto-heal-a-system-in-the-cloud/
Thank you!
Let Skynet take over the world to make it a better place.
@SylvainKalache
If you are interested in Skynet please donʼt hesitate to reach me!

Contenu connexe

Tendances

ITkonekt 2019 | Robert C. Martin (Uncle Bob), Clean Architecture and Design
ITkonekt 2019 | Robert C. Martin (Uncle Bob), Clean Architecture and DesignITkonekt 2019 | Robert C. Martin (Uncle Bob), Clean Architecture and Design
ITkonekt 2019 | Robert C. Martin (Uncle Bob), Clean Architecture and DesignErginBilgin3
 
강성훈, 실버바인 대기열 서버 설계 리뷰, NDC2019
강성훈, 실버바인 대기열 서버 설계 리뷰, NDC2019강성훈, 실버바인 대기열 서버 설계 리뷰, NDC2019
강성훈, 실버바인 대기열 서버 설계 리뷰, NDC2019devCAT Studio, NEXON
 
"Simple Made Easy" Made Easy
"Simple Made Easy" Made Easy"Simple Made Easy" Made Easy
"Simple Made Easy" Made EasyKent Ohashi
 
40000 コンテナを動かす SRE チームに至るまでの道 1/25(土) SRE NEXT 2020 発表資料 #srenext
40000 コンテナを動かす SRE チームに至るまでの道 1/25(土) SRE NEXT 2020 発表資料 #srenext40000 コンテナを動かす SRE チームに至るまでの道 1/25(土) SRE NEXT 2020 発表資料 #srenext
40000 コンテナを動かす SRE チームに至るまでの道 1/25(土) SRE NEXT 2020 発表資料 #srenextYahoo!デベロッパーネットワーク
 
アプリケーションコードにおける技術的負債について考える
アプリケーションコードにおける技術的負債について考えるアプリケーションコードにおける技術的負債について考える
アプリケーションコードにおける技術的負債について考えるpospome
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?Sadayuki Furuhashi
 
아파치 카프카 입문과 활용 강의자료
아파치 카프카 입문과 활용 강의자료아파치 카프카 입문과 활용 강의자료
아파치 카프카 입문과 활용 강의자료원영 최
 
Data Pipelining Across AWS and GCP
Data Pipelining Across AWS and GCPData Pipelining Across AWS and GCP
Data Pipelining Across AWS and GCPJiangjun Huang
 
Modern C# Programming 現代的なC#の書き方、ライブラリの選び方
Modern C# Programming 現代的なC#の書き方、ライブラリの選び方Modern C# Programming 現代的なC#の書き方、ライブラリの選び方
Modern C# Programming 現代的なC#の書き方、ライブラリの選び方Yoshifumi Kawai
 
OpenAI の音声認識 AI「Whisper」をテストしてみた
OpenAI の音声認識 AI「Whisper」をテストしてみたOpenAI の音声認識 AI「Whisper」をテストしてみた
OpenAI の音声認識 AI「Whisper」をテストしてみたHide Koba
 
개발자 이승우 이력서 (2016)
개발자 이승우 이력서 (2016)개발자 이승우 이력서 (2016)
개발자 이승우 이력서 (2016)SeungWoo Lee
 
Yahoo! JAPANのサービス開発を10倍早くした社内PaaS構築の今とこれから
Yahoo! JAPANのサービス開発を10倍早くした社内PaaS構築の今とこれからYahoo! JAPANのサービス開発を10倍早くした社内PaaS構築の今とこれから
Yahoo! JAPANのサービス開発を10倍早くした社内PaaS構築の今とこれからYahoo!デベロッパーネットワーク
 
Little Big Data #1. 바닥부터 시작하는 데이터 인프라
Little Big Data #1. 바닥부터 시작하는 데이터 인프라Little Big Data #1. 바닥부터 시작하는 데이터 인프라
Little Big Data #1. 바닥부터 시작하는 데이터 인프라Seongyun Byeon
 
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55Preferred Networks
 
Facebook은 React를 왜 만들었을까?
Facebook은 React를 왜 만들었을까? Facebook은 React를 왜 만들었을까?
Facebook은 React를 왜 만들었을까? Kim Hunmin
 
[NDC 2018] Spark, Flintrock, Airflow 로 구현하는 탄력적이고 유연한 데이터 분산처리 자동화 인프라 구축
[NDC 2018] Spark, Flintrock, Airflow 로 구현하는 탄력적이고 유연한 데이터 분산처리 자동화 인프라 구축[NDC 2018] Spark, Flintrock, Airflow 로 구현하는 탄력적이고 유연한 데이터 분산처리 자동화 인프라 구축
[NDC 2018] Spark, Flintrock, Airflow 로 구현하는 탄력적이고 유연한 데이터 분산처리 자동화 인프라 구축Juhong Park
 
RESTful API (JAX-RS) 書くだけで仕様書も 自動で作られていく話 with MicroProfile Open API
RESTful API (JAX-RS) 書くだけで仕様書も自動で作られていく話 with MicroProfile Open APIRESTful API (JAX-RS) 書くだけで仕様書も自動で作られていく話 with MicroProfile Open API
RESTful API (JAX-RS) 書くだけで仕様書も 自動で作られていく話 with MicroProfile Open APIKohei Saito
 
PHPの戻り値型宣言でselfを使ってみよう
PHPの戻り値型宣言でselfを使ってみようPHPの戻り値型宣言でselfを使ってみよう
PHPの戻り値型宣言でselfを使ってみようDQNEO
 
[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)NAVER D2
 
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편Seongyun Byeon
 

Tendances (20)

ITkonekt 2019 | Robert C. Martin (Uncle Bob), Clean Architecture and Design
ITkonekt 2019 | Robert C. Martin (Uncle Bob), Clean Architecture and DesignITkonekt 2019 | Robert C. Martin (Uncle Bob), Clean Architecture and Design
ITkonekt 2019 | Robert C. Martin (Uncle Bob), Clean Architecture and Design
 
강성훈, 실버바인 대기열 서버 설계 리뷰, NDC2019
강성훈, 실버바인 대기열 서버 설계 리뷰, NDC2019강성훈, 실버바인 대기열 서버 설계 리뷰, NDC2019
강성훈, 실버바인 대기열 서버 설계 리뷰, NDC2019
 
"Simple Made Easy" Made Easy
"Simple Made Easy" Made Easy"Simple Made Easy" Made Easy
"Simple Made Easy" Made Easy
 
40000 コンテナを動かす SRE チームに至るまでの道 1/25(土) SRE NEXT 2020 発表資料 #srenext
40000 コンテナを動かす SRE チームに至るまでの道 1/25(土) SRE NEXT 2020 発表資料 #srenext40000 コンテナを動かす SRE チームに至るまでの道 1/25(土) SRE NEXT 2020 発表資料 #srenext
40000 コンテナを動かす SRE チームに至るまでの道 1/25(土) SRE NEXT 2020 発表資料 #srenext
 
アプリケーションコードにおける技術的負債について考える
アプリケーションコードにおける技術的負債について考えるアプリケーションコードにおける技術的負債について考える
アプリケーションコードにおける技術的負債について考える
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?
 
아파치 카프카 입문과 활용 강의자료
아파치 카프카 입문과 활용 강의자료아파치 카프카 입문과 활용 강의자료
아파치 카프카 입문과 활용 강의자료
 
Data Pipelining Across AWS and GCP
Data Pipelining Across AWS and GCPData Pipelining Across AWS and GCP
Data Pipelining Across AWS and GCP
 
Modern C# Programming 現代的なC#の書き方、ライブラリの選び方
Modern C# Programming 現代的なC#の書き方、ライブラリの選び方Modern C# Programming 現代的なC#の書き方、ライブラリの選び方
Modern C# Programming 現代的なC#の書き方、ライブラリの選び方
 
OpenAI の音声認識 AI「Whisper」をテストしてみた
OpenAI の音声認識 AI「Whisper」をテストしてみたOpenAI の音声認識 AI「Whisper」をテストしてみた
OpenAI の音声認識 AI「Whisper」をテストしてみた
 
개발자 이승우 이력서 (2016)
개발자 이승우 이력서 (2016)개발자 이승우 이력서 (2016)
개발자 이승우 이력서 (2016)
 
Yahoo! JAPANのサービス開発を10倍早くした社内PaaS構築の今とこれから
Yahoo! JAPANのサービス開発を10倍早くした社内PaaS構築の今とこれからYahoo! JAPANのサービス開発を10倍早くした社内PaaS構築の今とこれから
Yahoo! JAPANのサービス開発を10倍早くした社内PaaS構築の今とこれから
 
Little Big Data #1. 바닥부터 시작하는 데이터 인프라
Little Big Data #1. 바닥부터 시작하는 데이터 인프라Little Big Data #1. 바닥부터 시작하는 데이터 인프라
Little Big Data #1. 바닥부터 시작하는 데이터 인프라
 
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
 
Facebook은 React를 왜 만들었을까?
Facebook은 React를 왜 만들었을까? Facebook은 React를 왜 만들었을까?
Facebook은 React를 왜 만들었을까?
 
[NDC 2018] Spark, Flintrock, Airflow 로 구현하는 탄력적이고 유연한 데이터 분산처리 자동화 인프라 구축
[NDC 2018] Spark, Flintrock, Airflow 로 구현하는 탄력적이고 유연한 데이터 분산처리 자동화 인프라 구축[NDC 2018] Spark, Flintrock, Airflow 로 구현하는 탄력적이고 유연한 데이터 분산처리 자동화 인프라 구축
[NDC 2018] Spark, Flintrock, Airflow 로 구현하는 탄력적이고 유연한 데이터 분산처리 자동화 인프라 구축
 
RESTful API (JAX-RS) 書くだけで仕様書も 自動で作られていく話 with MicroProfile Open API
RESTful API (JAX-RS) 書くだけで仕様書も自動で作られていく話 with MicroProfile Open APIRESTful API (JAX-RS) 書くだけで仕様書も自動で作られていく話 with MicroProfile Open API
RESTful API (JAX-RS) 書くだけで仕様書も 自動で作られていく話 with MicroProfile Open API
 
PHPの戻り値型宣言でselfを使ってみよう
PHPの戻り値型宣言でselfを使ってみようPHPの戻り値型宣言でselfを使ってみよう
PHPの戻り値型宣言でselfを使ってみよう
 
[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)
 
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
 

En vedette

CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides
CloudCamp Chicago - Big Data & Cloud May 2015 - All SlidesCloudCamp Chicago - Big Data & Cloud May 2015 - All Slides
CloudCamp Chicago - Big Data & Cloud May 2015 - All SlidesCloudCamp Chicago
 
CloudCamp Chicago lightning talk "Connecting Vehicles on Google Cloud Platfor...
CloudCamp Chicago lightning talk "Connecting Vehicles on Google Cloud Platfor...CloudCamp Chicago lightning talk "Connecting Vehicles on Google Cloud Platfor...
CloudCamp Chicago lightning talk "Connecting Vehicles on Google Cloud Platfor...CloudCamp Chicago
 
Cashing in on logging and exception data
Cashing in on logging and exception dataCashing in on logging and exception data
Cashing in on logging and exception dataStackify
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoringRohit Jnagal
 
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...Adrian Cockcroft
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudDatadog
 
Lifting the Blinds: Monitoring Windows Server 2012
Lifting the Blinds: Monitoring Windows Server 2012Lifting the Blinds: Monitoring Windows Server 2012
Lifting the Blinds: Monitoring Windows Server 2012Datadog
 
Deep-Dive to Application Insights
Deep-Dive to Application Insights Deep-Dive to Application Insights
Deep-Dive to Application Insights Gunnar Peipman
 
Intro to open source telemetry linux con 2016
Intro to open source telemetry   linux con 2016Intro to open source telemetry   linux con 2016
Intro to open source telemetry linux con 2016Matthew Broberg
 
Sysdig Monitorama Slides
Sysdig Monitorama SlidesSysdig Monitorama Slides
Sysdig Monitorama SlidesLoris Degioanni
 
RMG203 Cloud Infrastructure and Application Monitoring with Amazon CloudWatch...
RMG203 Cloud Infrastructure and Application Monitoring with Amazon CloudWatch...RMG203 Cloud Infrastructure and Application Monitoring with Amazon CloudWatch...
RMG203 Cloud Infrastructure and Application Monitoring with Amazon CloudWatch...Amazon Web Services
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
Intel SoC as a Platform to Connect Sensor Data to AWS
Intel SoC as a Platform to Connect Sensor Data to AWSIntel SoC as a Platform to Connect Sensor Data to AWS
Intel SoC as a Platform to Connect Sensor Data to AWSAmazon Web Services
 
Volta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a ServiceVolta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a ServiceLN Renganarayana
 
Netflix: From Clouds to Roots
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to RootsBrendan Gregg
 
AWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAdrian Cockcroft
 

En vedette (20)

Fluentd at SlideShare
Fluentd at SlideShareFluentd at SlideShare
Fluentd at SlideShare
 
CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides
CloudCamp Chicago - Big Data & Cloud May 2015 - All SlidesCloudCamp Chicago - Big Data & Cloud May 2015 - All Slides
CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides
 
CloudCamp Chicago lightning talk "Connecting Vehicles on Google Cloud Platfor...
CloudCamp Chicago lightning talk "Connecting Vehicles on Google Cloud Platfor...CloudCamp Chicago lightning talk "Connecting Vehicles on Google Cloud Platfor...
CloudCamp Chicago lightning talk "Connecting Vehicles on Google Cloud Platfor...
 
Cashing in on logging and exception data
Cashing in on logging and exception dataCashing in on logging and exception data
Cashing in on logging and exception data
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoring
 
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
 
Lifting the Blinds: Monitoring Windows Server 2012
Lifting the Blinds: Monitoring Windows Server 2012Lifting the Blinds: Monitoring Windows Server 2012
Lifting the Blinds: Monitoring Windows Server 2012
 
Potentio lab report
Potentio lab reportPotentio lab report
Potentio lab report
 
Data Logging and Telemetry
Data Logging and TelemetryData Logging and Telemetry
Data Logging and Telemetry
 
Deep-Dive to Application Insights
Deep-Dive to Application Insights Deep-Dive to Application Insights
Deep-Dive to Application Insights
 
Intro to open source telemetry linux con 2016
Intro to open source telemetry   linux con 2016Intro to open source telemetry   linux con 2016
Intro to open source telemetry linux con 2016
 
Sysdig Monitorama Slides
Sysdig Monitorama SlidesSysdig Monitorama Slides
Sysdig Monitorama Slides
 
RMG203 Cloud Infrastructure and Application Monitoring with Amazon CloudWatch...
RMG203 Cloud Infrastructure and Application Monitoring with Amazon CloudWatch...RMG203 Cloud Infrastructure and Application Monitoring with Amazon CloudWatch...
RMG203 Cloud Infrastructure and Application Monitoring with Amazon CloudWatch...
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
Intel SoC as a Platform to Connect Sensor Data to AWS
Intel SoC as a Platform to Connect Sensor Data to AWSIntel SoC as a Platform to Connect Sensor Data to AWS
Intel SoC as a Platform to Connect Sensor Data to AWS
 
Skynet Week 9 H4D Stanford 2016
Skynet Week 9 H4D Stanford 2016Skynet Week 9 H4D Stanford 2016
Skynet Week 9 H4D Stanford 2016
 
Volta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a ServiceVolta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a Service
 
Netflix: From Clouds to Roots
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to Roots
 
AWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at Netflix
 

Similaire à Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud

Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Brian Brazil
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018Christophe Rochefolle
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Brian Brazil
 
Evolving your api architecture with the strangler pattern
Evolving your api architecture with the strangler patternEvolving your api architecture with the strangler pattern
Evolving your api architecture with the strangler patterndwcarter74
 
Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki
Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowakiStreaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki
Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowakijavier ramirez
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Brian Brazil
 
Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...
Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...
Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...Cloud Native Day Tel Aviv
 
From 🤦 to 🐿️
From 🤦 to 🐿️From 🤦 to 🐿️
From 🤦 to 🐿️Ori Pekelman
 
Security for AWS : Journey to Least Privilege (update)
Security for AWS : Journey to Least Privilege (update)Security for AWS : Journey to Least Privilege (update)
Security for AWS : Journey to Least Privilege (update)dhubbard858
 
Security for AWS: Journey to Least Privilege
Security for AWS: Journey to Least PrivilegeSecurity for AWS: Journey to Least Privilege
Security for AWS: Journey to Least PrivilegeLacework
 
A Practical Guide to Anomaly Detection for DevOps
A Practical Guide to Anomaly Detection for DevOpsA Practical Guide to Anomaly Detection for DevOps
A Practical Guide to Anomaly Detection for DevOpsBigPanda
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
 
Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)Eran Levy
 
Has serverless adoption hit a roadblock?
Has serverless adoption hit a roadblock?Has serverless adoption hit a roadblock?
Has serverless adoption hit a roadblock?Veselin Pizurica
 
Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...Sergey Platonov
 
Infrastructure as Code, Theory Crash Course
Infrastructure as Code, Theory Crash CourseInfrastructure as Code, Theory Crash Course
Infrastructure as Code, Theory Crash CourseDr. Sven Balnojan
 
Serverless microservices in the wild
Serverless microservices in the wildServerless microservices in the wild
Serverless microservices in the wildRotem Tamir
 
Moving to Microservices with the Help of Distributed Traces
Moving to Microservices with the Help of Distributed TracesMoving to Microservices with the Help of Distributed Traces
Moving to Microservices with the Help of Distributed TracesKP Kaiser
 

Similaire à Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud (20)

Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
 
Evolving your api architecture with the strangler pattern
Evolving your api architecture with the strangler patternEvolving your api architecture with the strangler pattern
Evolving your api architecture with the strangler pattern
 
Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki
Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowakiStreaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki
Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)
 
Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...
Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...
Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...
 
From 🤦 to 🐿️
From 🤦 to 🐿️From 🤦 to 🐿️
From 🤦 to 🐿️
 
Security for AWS : Journey to Least Privilege (update)
Security for AWS : Journey to Least Privilege (update)Security for AWS : Journey to Least Privilege (update)
Security for AWS : Journey to Least Privilege (update)
 
Security for AWS: Journey to Least Privilege
Security for AWS: Journey to Least PrivilegeSecurity for AWS: Journey to Least Privilege
Security for AWS: Journey to Least Privilege
 
A Practical Guide to Anomaly Detection for DevOps
A Practical Guide to Anomaly Detection for DevOpsA Practical Guide to Anomaly Detection for DevOps
A Practical Guide to Anomaly Detection for DevOps
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
 
Has serverless adoption hit a roadblock?
Has serverless adoption hit a roadblock?Has serverless adoption hit a roadblock?
Has serverless adoption hit a roadblock?
 
Dev Ops without the Ops
Dev Ops without the OpsDev Ops without the Ops
Dev Ops without the Ops
 
Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...
 
Infrastructure as Code, Theory Crash Course
Infrastructure as Code, Theory Crash CourseInfrastructure as Code, Theory Crash Course
Infrastructure as Code, Theory Crash Course
 
Serverless microservices in the wild
Serverless microservices in the wildServerless microservices in the wild
Serverless microservices in the wild
 
Moving to Microservices with the Help of Distributed Traces
Moving to Microservices with the Help of Distributed TracesMoving to Microservices with the Help of Distributed Traces
Moving to Microservices with the Help of Distributed Traces
 

Plus de Sylvain Kalache

Will 2018 be the year a robot takes your job? It's not clear whether AI and a...
Will 2018 be the year a robot takes your job? It's not clear whether AI and a...Will 2018 be the year a robot takes your job? It's not clear whether AI and a...
Will 2018 be the year a robot takes your job? It's not clear whether AI and a...Sylvain Kalache
 
Why more tech companies are looking for workers equipped with 'soft skills'
Why more tech companies are looking for workers equipped with 'soft skills'Why more tech companies are looking for workers equipped with 'soft skills'
Why more tech companies are looking for workers equipped with 'soft skills'Sylvain Kalache
 
Speakers for the Pathways to Youth Employment conference
Speakers for the Pathways to Youth Employment conferenceSpeakers for the Pathways to Youth Employment conference
Speakers for the Pathways to Youth Employment conferenceSylvain Kalache
 
Are you ready for the 4th industrial revolution?
Are you ready for the 4th industrial revolution?Are you ready for the 4th industrial revolution?
Are you ready for the 4th industrial revolution?Sylvain Kalache
 
Company culture difference between France & USA
Company culture difference between France & USACompany culture difference between France & USA
Company culture difference between France & USASylvain Kalache
 
while42 Saint-Émilion - La Fleur Picon
while42 Saint-Émilion - La Fleur Piconwhile42 Saint-Émilion - La Fleur Picon
while42 Saint-Émilion - La Fleur PiconSylvain Kalache
 
while42 the untold story
while42 the untold storywhile42 the untold story
while42 the untold storySylvain Kalache
 
Guillaume & Maurice invitation
Guillaume & Maurice invitationGuillaume & Maurice invitation
Guillaume & Maurice invitationSylvain Kalache
 

Plus de Sylvain Kalache (12)

Will 2018 be the year a robot takes your job? It's not clear whether AI and a...
Will 2018 be the year a robot takes your job? It's not clear whether AI and a...Will 2018 be the year a robot takes your job? It's not clear whether AI and a...
Will 2018 be the year a robot takes your job? It's not clear whether AI and a...
 
Why more tech companies are looking for workers equipped with 'soft skills'
Why more tech companies are looking for workers equipped with 'soft skills'Why more tech companies are looking for workers equipped with 'soft skills'
Why more tech companies are looking for workers equipped with 'soft skills'
 
Speakers for the Pathways to Youth Employment conference
Speakers for the Pathways to Youth Employment conferenceSpeakers for the Pathways to Youth Employment conference
Speakers for the Pathways to Youth Employment conference
 
Are you ready for the 4th industrial revolution?
Are you ready for the 4th industrial revolution?Are you ready for the 4th industrial revolution?
Are you ready for the 4th industrial revolution?
 
Company culture difference between France & USA
Company culture difference between France & USACompany culture difference between France & USA
Company culture difference between France & USA
 
while42 Saint-Émilion - La Fleur Picon
while42 Saint-Émilion - La Fleur Piconwhile42 Saint-Émilion - La Fleur Picon
while42 Saint-Émilion - La Fleur Picon
 
while42 the untold story
while42 the untold storywhile42 the untold story
while42 the untold story
 
Guillaume & Maurice invitation
Guillaume & Maurice invitationGuillaume & Maurice invitation
Guillaume & Maurice invitation
 
Sylvain-Kalache-resume
Sylvain-Kalache-resumeSylvain-Kalache-resume
Sylvain-Kalache-resume
 
SlideShare culture
SlideShare cultureSlideShare culture
SlideShare culture
 
SlideShare turns 5
SlideShare turns 5SlideShare turns 5
SlideShare turns 5
 
Tartiflette
TartifletteTartiflette
Tartiflette
 

Dernier

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 

Dernier (20)

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 

Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud

  • 1. Skynet Project Monitor, analyze, scale, and maintain an infrastructure in the Cloud Or everything humans should not do @SylvainKalache
  • 2. What about me? •Operations Engineer @SlideShare since 2011 •Fan of system automation •@SylvainKalache on Twitter
  • 3. The Cloud The cloud is now everywhere, it's elastic, dynamic and far from the rigid infrastructures what we had before. Most of the modern architecture are using "The Cloud" because give so much possibilities that no one can ignore it.
  • 4. What about our tools? But what about our tools? Are they designed to deal with the cloud? Nagios and Ganglia, for example, are obviously not designed for. What are the alternative? Not much heh? Of course we could have use Cloudwatch because we are using EC2, but you the possibilities are still limited in term of granularity, personalization and you hit a limit at some point.
  • 5. When these same tools does their job, something break, Nagios complains, Ganglia show the problem in its graph... And then what? Nothing, we need human intervention. Why?
  • 6. Automation? Why humans, still do repetitive things, have to react to events that system could handle? Why we aren't we automating the management of systems?
  • 7. What brought Skynet to life? Slideshare has an infrastructure hosted in EC2 for the document conversion. We have a lot of "factories" which are converting different types of documents. This infrastructure scale up and down on demand.
  • 8. Old scaling script Our all scaling process was divided in 3 parts: 1. A Bash script to launch and configure instances 2. Monit and Ruby code to maintain the code running. 3. Another Ruby script to shutdown the server. These 3 pieces of code were not communicating between each other and were not failure safe.
  • 9. •Idle •Immortal •Wasting money Zombie instances Meaning that we ended up with instances not running code. That would not shutdown themselves and were doing nothing but wasting money.
  • 10. •No metrics - blind •Wasting time fixing Mummy Ops For Ops we had issue with no visibility about what was going on with our infrastructure (we were only monitoring the SQS queue). We ended up wasting so much time investigating and then eventually fixing problem. There was also a lack of feedback for the developers working on the conversion code, are we converting faster? Better?
  • 11. With Skynet Controller The idea was to create a Controller that would manage the whole instance life time but not only. It would scale inteligently based on the current state of the system but also based on trend that we can generate by using historical data.
  • 12. Skynet architecture The idea of Skynet was to make something flexible so that we could use it elsewhere. It should not be architecture specifically for our problem but for any system. Why not open sourcing it at some point?
  • 13. Collectors: Ruby daemons for collecting system metrics and a gem for application logs. Log collection: Fluentd Datastore: MongoDB Query API: Ruby + Sinatra Controller: Ruby + MCollective + Puppet + Fog Dashboard: Investigation and monitoring tools
  • 14. Collect system metrics Collect applications logs Collectors Collect system metrics via a Ruby daemon running on each machine, we can collect any metrics via plugins. Collect application logs via a gem. Those 2 components send data to Fluentd locally.
  • 15. Fluentd •Light •Written in Ruby •Handle failure •Plugins •Local forwarder •Aggregation & routing •Stream processing Why? What? Fluentd is in charge of collecting the log, routing and carrying them to their endpoint. It handle failure properly (failover to over node + will backup log if next hope is down) Written in Ruby, super light. The advantage is also the infrastructure, any output/input is managed via plugin that you create/customize without have to mess with the Fluentd core.
  • 16. •Schema less •Key/value match log format •Store system metrics & application logs •Jobs metadata Why? What? MongoDB is schema less, which give us the possibility to make our data schema evolve very easily. MongoDB is fiting-ish the bill for the format 1 log entry one Mongo document. We store system metrics and informations (CPU, memory, top output, disk, Nginx active connection...) but also application logs.
  • 17. Abstraction of the datastore Easy REST interface Keep control over requests processed Post-request computing via plugin API The abstraction layer/REST API give use the possibility to use any datastore in the backend, we are now using MongoDB but we could use Elastic search in front of it later. Also depending of the type data, we store them in different MongoDB cluster, but this is totally transparent to the data consumer. We keep control over whatʼs possible to do so that a single data consumer does not crash the whole datastore. We offer the possibility to have computing plugin to process the data right on the API server before returning them to the requester. Better to process data as near as by the source to avoid having to transport big chunk of data over network.
  • 18. Monitor all the things Granular & global view Investigation tool Dashboard We can monitor all kind of stuff, possibilities are infinite with the combination of the Rubym daemon and the gem. We can use all these data to have an global overview of our infrastructure but also have granularity and see that at this particular second there was a CPU spike for a job on this machine converting this document. It's also a useful debugging tool.
  • 19. Automate Scale Fix Alert Controller Can use all the data collected to scale, based on current state but also trends. Can take action if a system is in a abnormal state try to fix it via possibility trees. And finally Skynet Controller can also alert, in case the system reach its limit and finally need a human, again...!
  • 20. Want to read more about it? ! Check out the blog post! ! http://engineering.slideshare.net/2014/04/skynet-project-monitor-scale-and-auto-heal-a-system-in-the-cloud/
  • 21. Thank you! Let Skynet take over the world to make it a better place. @SylvainKalache If you are interested in Skynet please donʼt hesitate to reach me!