SlideShare une entreprise Scribd logo
1  sur  47
Télécharger pour lire hors ligne
Django Application Monitoring
with Sentry, ELK and
Prometheus
By Ridwan Fadjar Septian
Cloud Infrastructure Engineer at NiceDay Nederland B.V.
PyCon ID 2021
Introduction
- My name is Ridwan Fadjar Septian
- Living in Bandung, Indonesia
- My career journey are:
- 2014 - 2016, Web Programmer by using PHP
- 2016 - 2017, Backend Engineer by using Django
- 2017, Backend Engineer for big data project by using AWS Lambda, AWS Kinesis, AWS
EMR + PySpark and AWS S3 for Data Lake. Also as Cloud Infrastructure Engineer
- 2017 - 2018, Backend Engineer by using Django. Also as Cloud Infrastructure Engineer
at NiceDay Nederland B.V.
- 2018 - Current, Cloud Infrastructure Engineer at NiceDay Nederland B.V. which is mostly
working with Google Cloud Platform
- My favorites
- Programming languages: Python and Javascript
- Web frameworks: Django
- Operating system: Linux
- My interests: Open Source Projects, AI, DevOps, Cloud Infrastructure, Software Engineering,
IT Governance, IT Security, Computer Networking, etc.
Overview
A. Company Background
- NiceDay Nederland B.V.
- Provide online mental healthcare provider since 2014
- Cover national market in Netherlands
- Planning to expand into international market
- Targetting to become a leader for mental healthcare service compete with
other companies in national sector
- Based in Rotterdam, NL
- Branch office in Bandung, ID
- +/- 50 employees Rotterdam and Bandung combined
- Came from diverse nationalities and background
- Visit us more here -> https://nicedaynederland.nl/en/home-en/
B. Problems
● How to provide secure services?
● How to ensure availability of our services?
● How to build a better security practice?
● How to give better experience for our users (therapists and clients)?
C. Goals
● Why we need monitoring and logging systems?
○ We are trying to give our users secure mental healthcare service
○ Highly available service for our users
○ Compliance with national, regional and international security standards
■ NEN 7510-02:2017 (Netherland’s national standard for health
information system security)
■ GDPR (Regional data security standard under European Union)
■ ISO 27001:2013 (International standard for information security
management system)
○ Better user experience for our users (therapists and clients)
Architectures
D. Architectures of Our Application - An Overview
E. Monitoring and Logging Architectures Overview
E. Architectures - Sentry 10
E. Architectures - Elasticsearch and Kibana
E. Architectures - Prometheus, AlertManager and OpsGenie
E. Architectures - Prometheus and Grafana
Current Implementation
F. Current Implementation - Elasticsearch + Kibana
● Elasticsearch + Kibana
○ Functions
■ Managing logs from Docker containers and hosts
■ Weekly log inspection
● Measures performance of our services (e.g. APDEX)
● Find any errors on Docker container logs or system logs
■ Root cause analysis on system or application logs per incident
■ Service endpoints deprecation
■ etc.
○ Ability
■ Retain all logs for more than years (long term)
■ Fast query on various logs for wide timerange
F. Current Implementation - Elasticsearch + Kibana (2)
● Deployment
○ Managed services at Elastic Cloud
○ Previously, we used Logstash to ingest Filebeat logs. But now, Filebeat
could send logs to Elasticsearch directly
F. Current Implementation - Sentry 10
● Sentry10
○ Functions
■ Manage bug / exception from our Django, Python, React.js and
React Native projects
● Bug management for every releases
■ Performance analytics tools for developers
■ Root cause analysis on application code level
● Bug tracing
○ Ability
■ Retain catched exceptions for years (long term)
F. Current Implementation - Sentry 10 (2)
● Deployment
○ On-premises at Google Cloud Platform
■ 3 VM instances to host Sentry 10 containers managed by container
orchestration
● E2-standard-4: vCPUs 4 cores, 16 GB of RAM
■ CloudSQL for Sentry10 database to store its event records
■ CloudStorage to host Sentry10 data
○ Sentry10 is quite complex. It should use Apache Kafka and Clickhouse
as its new data stores.
F. Current Implementation - Prometheus
● Prometheus + Grafana
○ Function
■ OKR evaluation
● Weekly
● Every 6 months
■ Root cause analysis by utilize server and application metrics
○ Ability
■ Retain resource and application metrics for a month (short term)
F. Current Implementation - Prometheus (2)
● Prometheus + Alert Manager + OpsGenie
○ Function
■ Services uptime monitoring
● Service performance whether its getting slower
■ VMs status monitoring
● Memory
● CPU
● Disk/IO
● Uptime
● etc.
○ Ability
■ Faster alerting system to Infrastructure Team
● Alert might come just under 1 minutes or 5 minutes
○ SMS
○ Push Notification
○ Phone Call
● OpsGenie will keep your phone ringing if you don’t response on it
yet.
F. Current Implementation - Prometheus (3)
● Deployment
○ On-premise at Google Cloud Platform
■ Single VM instance to host Prometheus and Alert Manager
● E2-standard-2: vCPUs 2 cores, 8 GB of RAM
■ Grafana is deployed at our container orchestration co-hosted with other
services for infrastructure team purposes.
F. Current Implementation - Security
We ensure the deployment of Prometheus, Elasticsearch + Kibana and Sentry by
applying this action:
- Deploy those tools under private network
- Only Infrastructure team have an access to those tools for managing purposes
- Every users for those tools have a least privileges.
- Only few person who become superadmin for administration purposes.
- Access to private network with 2FA enabled
F. Current Implementation vs The History Behind it
- Back to 2017, we have used New Relic as our monitoring tool.
- But it the capability for storing log from our servers and Docker containers weren’t
satisfying. Therefore, we built Elasticsearch on-premise cluster
- The alerting system weren’t satisfying also. So we built our alerting system by using
Prometheus on-premise
- Finally, we found that Sentry 9 was simpler than New Relic for managing exceptions
from our application. So we built our bug management by using Sentry 9
- 2019, Sentry and Prometheus moved to Google Cloud Platform as on premise
- We faced networking issue from local cloud provider. So we could deploy our
infrastructure in unstable situation.
- 2019, Elasticsearch + Kibana upgraded
- We moved Elasticsearch and Kibana to Elasticloud because the log size we managed
was nearly 1TB and its really hard to scale. Moreover, the networking issue was one the
main problem of that local cloud provider
- 2020, Sentry upgraded from version 9 to 10
- We moved to Sentry10 because we want to use the APM which provided by this new
version. But we still deploy it on-premise at Google Cloud Platform. The cost for Sentry
Cloud is quite expensive as its charged per num of developers in our company.
Usage Examples
G. Usage examples - Prometheus
G. Usage examples - Prometheus
G. Usage examples - Prometheus
G. Usage examples - Prometheus + OpsGenie
G. Usage examples - Elasticsearch + Kibana
G. Usage examples - Sentry 10
G. Usage examples - Sentry 10
G. Usage examples - Sentry 10
G. Usage examples - Sentry 10
G. Usage examples - Sentry 10
G. Usage examples - Sentry 10
Impacts
H. Impacts
● Those tools help us to provide secure services
○ Prometheus + OpsGenie
■ Warn us if SSL certificate are going to be expired.
○ Elasticsearch + Kibana
■ Weekly log inspection
● Anomaly in HTTP requests came to our services
○ Call to unknown endpoints
○ Strange number of requests that came exceeding
normal requests per seconds.
● Find someone suspicious who perform SSH beside from our
whitelisted users
● Find suspicious scripts which are being executed by CRON
● Find commands executed by whitelisted users which might
put our services in danger
○ Sentry
■ Find any parts of application that might led to bug
○ etc.
H. Impacts (2)
● Those tools help us to ensure availability of our services
○ Prometheus + OpsGenie
■ Faster response time upon incidents in our infrastructure 24/7
■ Improve our infrastructure by keep them optimized and efficient
● Reduce cost for underperforming VMs
■ Detect unapplied migration scripts from backend service
● It might led to crash for backend service if we can’t detect it earlier
○ Elasticsearch + Kibana
■ High availability log inspection to help root cause analysis when incident
happened
● Find any errors output on Docker container logs across our
Docker-based services
● Find any errors output on system logs across our servers
■ We don’t have to SSH to our servers to find system error logs
■ We don’t have to check Docker logs to find service error logs
○ Sentry
■ We could configure Sentry to send OpsGenie alert. It could be triggered when
exception catched from our services.
○ etc.
H. Impacts (3)
● Those tools help us to build a better security practice
○ Elasticsearch + Kibana
■ High availability log inspection to perform further root cause analysis
after incident happened last week or last month
○ Prometheus + Grafana,
■ Monitor incident response performance through various
sources
● MTTA, mean time to acknowledge
● MTTR, mean time to resolve
● MTBF, mean time between failure
● 99PTA, 99 percentiles time to acknowledge
● 99PTR, 99 percentiles time to resolve
■ Decide better strategies every new OKR period.
● For example, infrastructure team maintain its workflows
which related to NiceDay security practice
H. Impacts (4)
● Those tools help us to give better experience for our users
○ Sentry
■ Faster debugging process in their codebases for developers
● They could find how exception produced through amazing stacktrace
visualization
● They could see where exceptions catched from particular release
● They could find to the line which exceptions catched
● For example, backend team could debug Django and Celery codebase
easily and faster
● Etc.
○ Elasticsearch + Kibana
■ Improve the backend service from performance analysis
■ Backend service endpoint deprecation
■ Help developers to find performance bottleneck of the service
H. Impacts (5)
● Other impacts
○ Stay compliance with some security standards for assurance to
clients.
○ Management could see the overview of service status when they
need it
○ Management could see in-house teams and products are growing
better
○ etc.
I. Best Practices
● Prometheus + OpsGenie, Refine your alerting rules periodically to be more suitable for
your team needs
● Whichever the tool
○ please enforce least privilege setup
■ Assign someone only what they need. Don’t give them role that are not
necessarily assigned out of their tasks
○ Enable two factor authentication when its possible
○ Setup process in your team to manage all credentials that you manage
■ You might utilizepassword managers (e.g. 1Password, DashLane,
BitWarden, LogMeOnce, etc.)
■ Manage secret key and password rotation to keep your monitoring
infrastructures secure
○ Evaluate your security-related processes in the team
■ Threat might come internally also. For example:
● Bug from development team
● Human error when performing particular task upon infrastructures
○ Connect to your logging infrastructures with private connection
■ Use secure approach to be connected with your third party logging services
○ Deploy and manage your logging infrastructures under private network
■ For example, separate monitoring and logging infrastructure private network
from warehouse, staging, production private networks.
Let’s wrap up
By enabling monitoring and logging systems, we might be able to:
● provide secure services
● ensure availability of our services
● build a better security practice
● give better experience for our users
References
● Sentry
○ https://develop.sentry.dev/self-hosted/
○ https://docs.sentry.io/product/
● Elastic Cloud
○ https://www.elastic.co/guide/index.html
○ https://www.elastic.co/guide/en/kibana/current/index.html
● Prometheus
○ https://prometheus.io/docs/prometheus/latest/getting_started/
○ https://prometheus.io/docs/alerting/latest/alertmanager/
○ https://support.atlassian.com/opsgenie/docs/integrate-opsgenie-with-prometheus/
● Security Practices, especially for Monitoring and Logging
○ https://sre.google/sre-book/table-of-contents/
○ NEN 7510-2:2017 - 12.4 Reporting and monitoring ->
https://www.webtoolmanagementsystemen.nl/en/ViewDocumentSection/d873e9df-44ae-413b-
8564-7ca7df60bde1/d873e9df-44ae-413b-8564-7ca7df60bde1/255021a3-1c42-4700-98f6-7f0
4eb16274f#8f13d102-3e26-4580-a20c-f4ae375725cb
○ ISO 27001:2013 - Annex A - A.12 Operations Security - A.12.4 Logging and Monitoring
Special Thanks!
● PyCon Indonesia 2021 who made this possible!
● Kurnia Jaya Eliazar, Team Manager at NiceDay, for reviewing my slide and
gave amazing feedbacks
● NiceDay Infrastructure Team, who gave me unlimited chances to implement
and improve NiceDay infrastructures
● Former Ebizu Data Team, who gave me a lot of chances for exploring about
AWS and Python application development on Big Data project.
● Bramandityo Prabowo, who used to teach me Python, Linux, Django and
many things at the college
Keep in touch
● Reach me at
○ E-mail: ridwanbejo@gmail.com
○ LinkedIn: https://www.linkedin.com/in/ridwan-fadjar-79781756/
○ Github: https://github.com/ridwanbejo
○ Google Scholar: https://scholar.google.com/citations?hl=en&user=edU-dL8AAAAJ
Q & A

Contenu connexe

Tendances

Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
Globe2Train: A Framework for Distributed ML Model Training using IoT Devices ...
Globe2Train: A Framework for Distributed ML Model Training using IoT Devices ...Globe2Train: A Framework for Distributed ML Model Training using IoT Devices ...
Globe2Train: A Framework for Distributed ML Model Training using IoT Devices ...
Bharath Sudharsan
 
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Spark Summit
 
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Spark Summit
 

Tendances (20)

Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
 
Introduction to High Performance Computing
Introduction to High Performance ComputingIntroduction to High Performance Computing
Introduction to High Performance Computing
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Optimized placement in Openstack for NFV
Optimized placement in Openstack for NFVOptimized placement in Openstack for NFV
Optimized placement in Openstack for NFV
 
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesIntegrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query Engines
 
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
 
My Dissertation 2016
My Dissertation 2016My Dissertation 2016
My Dissertation 2016
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
 
Web application
Web applicationWeb application
Web application
 
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
 
High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...
 
Connecting kafka message systems with scylla
Connecting kafka message systems with scylla   Connecting kafka message systems with scylla
Connecting kafka message systems with scylla
 
Globe2Train: A Framework for Distributed ML Model Training using IoT Devices ...
Globe2Train: A Framework for Distributed ML Model Training using IoT Devices ...Globe2Train: A Framework for Distributed ML Model Training using IoT Devices ...
Globe2Train: A Framework for Distributed ML Model Training using IoT Devices ...
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan Zvara
 
The Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInThe Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedIn
 
Hadoop summit 2016
Hadoop summit 2016Hadoop summit 2016
Hadoop summit 2016
 
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
 
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
 
Simulating Heterogeneous Resources in CloudLightning
Simulating Heterogeneous Resources in CloudLightningSimulating Heterogeneous Resources in CloudLightning
Simulating Heterogeneous Resources in CloudLightning
 

Similaire à Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitoring with sentry, elk and prometheus

Similaire à Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitoring with sentry, elk and prometheus (20)

Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Program
 
Building a Distributed & Automated Open Source Program at Netflix
Building a Distributed & Automated Open Source Program at NetflixBuilding a Distributed & Automated Open Source Program at Netflix
Building a Distributed & Automated Open Source Program at Netflix
 
#RADC4L16: An API-First Archives Approach at NPR
#RADC4L16: An API-First Archives Approach at NPR#RADC4L16: An API-First Archives Approach at NPR
#RADC4L16: An API-First Archives Approach at NPR
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Google Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionGoogle Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data edition
 
Netflix Architecture and Open Source
Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open Source
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
vinay-mittal-new
vinay-mittal-newvinay-mittal-new
vinay-mittal-new
 
Introduction to PaaS and Heroku
Introduction to PaaS and HerokuIntroduction to PaaS and Heroku
Introduction to PaaS and Heroku
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-AriThinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
 
NGINX Microservices Reference Architecture: What’s in Store for 2019 – EMEA
NGINX Microservices Reference Architecture: What’s in Store for 2019 – EMEANGINX Microservices Reference Architecture: What’s in Store for 2019 – EMEA
NGINX Microservices Reference Architecture: What’s in Store for 2019 – EMEA
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
Designing for operability and managability
Designing for operability and managabilityDesigning for operability and managability
Designing for operability and managability
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
 
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containers
 
Controlled Evolution with Puppet and AWS
Controlled Evolution with Puppet and AWSControlled Evolution with Puppet and AWS
Controlled Evolution with Puppet and AWS
 

Plus de Ridwan Fadjar

Plus de Ridwan Fadjar (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
PyCon ID 2023 - Ridwan Fadjar Septian.pdf
PyCon ID 2023 - Ridwan Fadjar Septian.pdfPyCon ID 2023 - Ridwan Fadjar Septian.pdf
PyCon ID 2023 - Ridwan Fadjar Septian.pdf
 
Cloud Infrastructure automation with Python-3.pdf
Cloud Infrastructure automation with Python-3.pdfCloud Infrastructure automation with Python-3.pdf
Cloud Infrastructure automation with Python-3.pdf
 
GraphQL- Presentation
GraphQL- PresentationGraphQL- Presentation
GraphQL- Presentation
 
Bugs and Where to Find Them (Study Case_ Backend).pdf
Bugs and Where to Find Them (Study Case_ Backend).pdfBugs and Where to Find Them (Study Case_ Backend).pdf
Bugs and Where to Find Them (Study Case_ Backend).pdf
 
Introduction to Elixir and Phoenix.pdf
Introduction to Elixir and Phoenix.pdfIntroduction to Elixir and Phoenix.pdf
Introduction to Elixir and Phoenix.pdf
 
CS meetup 2020 - Introduction to DevOps
CS meetup 2020 - Introduction to DevOpsCS meetup 2020 - Introduction to DevOps
CS meetup 2020 - Introduction to DevOps
 
Why Serverless?
Why Serverless?Why Serverless?
Why Serverless?
 
SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)
SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)
SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)
 
Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...
Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...
Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...
 
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseA Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
 
Mongodb intro-2-asbasdat-2018-v2
Mongodb intro-2-asbasdat-2018-v2Mongodb intro-2-asbasdat-2018-v2
Mongodb intro-2-asbasdat-2018-v2
 
Mongodb intro-2-asbasdat-2018
Mongodb intro-2-asbasdat-2018Mongodb intro-2-asbasdat-2018
Mongodb intro-2-asbasdat-2018
 
Mongodb intro-1-asbasdat-2018
Mongodb intro-1-asbasdat-2018Mongodb intro-1-asbasdat-2018
Mongodb intro-1-asbasdat-2018
 
Resftul API Web Development with Django Rest Framework & Celery
Resftul API Web Development with Django Rest Framework & CeleryResftul API Web Development with Django Rest Framework & Celery
Resftul API Web Development with Django Rest Framework & Celery
 
Memulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan Python
 
Kisah Dua Sejoli: Arduino & Python
Kisah Dua Sejoli: Arduino & PythonKisah Dua Sejoli: Arduino & Python
Kisah Dua Sejoli: Arduino & Python
 
Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014
Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014
Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014
 
Modul pelatihan-django-dasar-possupi-v1
Modul pelatihan-django-dasar-possupi-v1Modul pelatihan-django-dasar-possupi-v1
Modul pelatihan-django-dasar-possupi-v1
 
Membuat game-shooting-dengan-pygame
Membuat game-shooting-dengan-pygameMembuat game-shooting-dengan-pygame
Membuat game-shooting-dengan-pygame
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitoring with sentry, elk and prometheus

  • 1. Django Application Monitoring with Sentry, ELK and Prometheus By Ridwan Fadjar Septian Cloud Infrastructure Engineer at NiceDay Nederland B.V. PyCon ID 2021
  • 2. Introduction - My name is Ridwan Fadjar Septian - Living in Bandung, Indonesia - My career journey are: - 2014 - 2016, Web Programmer by using PHP - 2016 - 2017, Backend Engineer by using Django - 2017, Backend Engineer for big data project by using AWS Lambda, AWS Kinesis, AWS EMR + PySpark and AWS S3 for Data Lake. Also as Cloud Infrastructure Engineer - 2017 - 2018, Backend Engineer by using Django. Also as Cloud Infrastructure Engineer at NiceDay Nederland B.V. - 2018 - Current, Cloud Infrastructure Engineer at NiceDay Nederland B.V. which is mostly working with Google Cloud Platform - My favorites - Programming languages: Python and Javascript - Web frameworks: Django - Operating system: Linux - My interests: Open Source Projects, AI, DevOps, Cloud Infrastructure, Software Engineering, IT Governance, IT Security, Computer Networking, etc.
  • 4. A. Company Background - NiceDay Nederland B.V. - Provide online mental healthcare provider since 2014 - Cover national market in Netherlands - Planning to expand into international market - Targetting to become a leader for mental healthcare service compete with other companies in national sector - Based in Rotterdam, NL - Branch office in Bandung, ID - +/- 50 employees Rotterdam and Bandung combined - Came from diverse nationalities and background - Visit us more here -> https://nicedaynederland.nl/en/home-en/
  • 5. B. Problems ● How to provide secure services? ● How to ensure availability of our services? ● How to build a better security practice? ● How to give better experience for our users (therapists and clients)?
  • 6. C. Goals ● Why we need monitoring and logging systems? ○ We are trying to give our users secure mental healthcare service ○ Highly available service for our users ○ Compliance with national, regional and international security standards ■ NEN 7510-02:2017 (Netherland’s national standard for health information system security) ■ GDPR (Regional data security standard under European Union) ■ ISO 27001:2013 (International standard for information security management system) ○ Better user experience for our users (therapists and clients)
  • 8. D. Architectures of Our Application - An Overview
  • 9. E. Monitoring and Logging Architectures Overview
  • 10. E. Architectures - Sentry 10
  • 11. E. Architectures - Elasticsearch and Kibana
  • 12. E. Architectures - Prometheus, AlertManager and OpsGenie
  • 13. E. Architectures - Prometheus and Grafana
  • 15. F. Current Implementation - Elasticsearch + Kibana ● Elasticsearch + Kibana ○ Functions ■ Managing logs from Docker containers and hosts ■ Weekly log inspection ● Measures performance of our services (e.g. APDEX) ● Find any errors on Docker container logs or system logs ■ Root cause analysis on system or application logs per incident ■ Service endpoints deprecation ■ etc. ○ Ability ■ Retain all logs for more than years (long term) ■ Fast query on various logs for wide timerange
  • 16. F. Current Implementation - Elasticsearch + Kibana (2) ● Deployment ○ Managed services at Elastic Cloud ○ Previously, we used Logstash to ingest Filebeat logs. But now, Filebeat could send logs to Elasticsearch directly
  • 17. F. Current Implementation - Sentry 10 ● Sentry10 ○ Functions ■ Manage bug / exception from our Django, Python, React.js and React Native projects ● Bug management for every releases ■ Performance analytics tools for developers ■ Root cause analysis on application code level ● Bug tracing ○ Ability ■ Retain catched exceptions for years (long term)
  • 18. F. Current Implementation - Sentry 10 (2) ● Deployment ○ On-premises at Google Cloud Platform ■ 3 VM instances to host Sentry 10 containers managed by container orchestration ● E2-standard-4: vCPUs 4 cores, 16 GB of RAM ■ CloudSQL for Sentry10 database to store its event records ■ CloudStorage to host Sentry10 data ○ Sentry10 is quite complex. It should use Apache Kafka and Clickhouse as its new data stores.
  • 19. F. Current Implementation - Prometheus ● Prometheus + Grafana ○ Function ■ OKR evaluation ● Weekly ● Every 6 months ■ Root cause analysis by utilize server and application metrics ○ Ability ■ Retain resource and application metrics for a month (short term)
  • 20. F. Current Implementation - Prometheus (2) ● Prometheus + Alert Manager + OpsGenie ○ Function ■ Services uptime monitoring ● Service performance whether its getting slower ■ VMs status monitoring ● Memory ● CPU ● Disk/IO ● Uptime ● etc. ○ Ability ■ Faster alerting system to Infrastructure Team ● Alert might come just under 1 minutes or 5 minutes ○ SMS ○ Push Notification ○ Phone Call ● OpsGenie will keep your phone ringing if you don’t response on it yet.
  • 21. F. Current Implementation - Prometheus (3) ● Deployment ○ On-premise at Google Cloud Platform ■ Single VM instance to host Prometheus and Alert Manager ● E2-standard-2: vCPUs 2 cores, 8 GB of RAM ■ Grafana is deployed at our container orchestration co-hosted with other services for infrastructure team purposes.
  • 22. F. Current Implementation - Security We ensure the deployment of Prometheus, Elasticsearch + Kibana and Sentry by applying this action: - Deploy those tools under private network - Only Infrastructure team have an access to those tools for managing purposes - Every users for those tools have a least privileges. - Only few person who become superadmin for administration purposes. - Access to private network with 2FA enabled
  • 23. F. Current Implementation vs The History Behind it - Back to 2017, we have used New Relic as our monitoring tool. - But it the capability for storing log from our servers and Docker containers weren’t satisfying. Therefore, we built Elasticsearch on-premise cluster - The alerting system weren’t satisfying also. So we built our alerting system by using Prometheus on-premise - Finally, we found that Sentry 9 was simpler than New Relic for managing exceptions from our application. So we built our bug management by using Sentry 9 - 2019, Sentry and Prometheus moved to Google Cloud Platform as on premise - We faced networking issue from local cloud provider. So we could deploy our infrastructure in unstable situation. - 2019, Elasticsearch + Kibana upgraded - We moved Elasticsearch and Kibana to Elasticloud because the log size we managed was nearly 1TB and its really hard to scale. Moreover, the networking issue was one the main problem of that local cloud provider - 2020, Sentry upgraded from version 9 to 10 - We moved to Sentry10 because we want to use the APM which provided by this new version. But we still deploy it on-premise at Google Cloud Platform. The cost for Sentry Cloud is quite expensive as its charged per num of developers in our company.
  • 25. G. Usage examples - Prometheus
  • 26. G. Usage examples - Prometheus
  • 27. G. Usage examples - Prometheus
  • 28. G. Usage examples - Prometheus + OpsGenie
  • 29. G. Usage examples - Elasticsearch + Kibana
  • 30. G. Usage examples - Sentry 10
  • 31. G. Usage examples - Sentry 10
  • 32. G. Usage examples - Sentry 10
  • 33. G. Usage examples - Sentry 10
  • 34. G. Usage examples - Sentry 10
  • 35. G. Usage examples - Sentry 10
  • 37. H. Impacts ● Those tools help us to provide secure services ○ Prometheus + OpsGenie ■ Warn us if SSL certificate are going to be expired. ○ Elasticsearch + Kibana ■ Weekly log inspection ● Anomaly in HTTP requests came to our services ○ Call to unknown endpoints ○ Strange number of requests that came exceeding normal requests per seconds. ● Find someone suspicious who perform SSH beside from our whitelisted users ● Find suspicious scripts which are being executed by CRON ● Find commands executed by whitelisted users which might put our services in danger ○ Sentry ■ Find any parts of application that might led to bug ○ etc.
  • 38. H. Impacts (2) ● Those tools help us to ensure availability of our services ○ Prometheus + OpsGenie ■ Faster response time upon incidents in our infrastructure 24/7 ■ Improve our infrastructure by keep them optimized and efficient ● Reduce cost for underperforming VMs ■ Detect unapplied migration scripts from backend service ● It might led to crash for backend service if we can’t detect it earlier ○ Elasticsearch + Kibana ■ High availability log inspection to help root cause analysis when incident happened ● Find any errors output on Docker container logs across our Docker-based services ● Find any errors output on system logs across our servers ■ We don’t have to SSH to our servers to find system error logs ■ We don’t have to check Docker logs to find service error logs ○ Sentry ■ We could configure Sentry to send OpsGenie alert. It could be triggered when exception catched from our services. ○ etc.
  • 39. H. Impacts (3) ● Those tools help us to build a better security practice ○ Elasticsearch + Kibana ■ High availability log inspection to perform further root cause analysis after incident happened last week or last month ○ Prometheus + Grafana, ■ Monitor incident response performance through various sources ● MTTA, mean time to acknowledge ● MTTR, mean time to resolve ● MTBF, mean time between failure ● 99PTA, 99 percentiles time to acknowledge ● 99PTR, 99 percentiles time to resolve ■ Decide better strategies every new OKR period. ● For example, infrastructure team maintain its workflows which related to NiceDay security practice
  • 40. H. Impacts (4) ● Those tools help us to give better experience for our users ○ Sentry ■ Faster debugging process in their codebases for developers ● They could find how exception produced through amazing stacktrace visualization ● They could see where exceptions catched from particular release ● They could find to the line which exceptions catched ● For example, backend team could debug Django and Celery codebase easily and faster ● Etc. ○ Elasticsearch + Kibana ■ Improve the backend service from performance analysis ■ Backend service endpoint deprecation ■ Help developers to find performance bottleneck of the service
  • 41. H. Impacts (5) ● Other impacts ○ Stay compliance with some security standards for assurance to clients. ○ Management could see the overview of service status when they need it ○ Management could see in-house teams and products are growing better ○ etc.
  • 42. I. Best Practices ● Prometheus + OpsGenie, Refine your alerting rules periodically to be more suitable for your team needs ● Whichever the tool ○ please enforce least privilege setup ■ Assign someone only what they need. Don’t give them role that are not necessarily assigned out of their tasks ○ Enable two factor authentication when its possible ○ Setup process in your team to manage all credentials that you manage ■ You might utilizepassword managers (e.g. 1Password, DashLane, BitWarden, LogMeOnce, etc.) ■ Manage secret key and password rotation to keep your monitoring infrastructures secure ○ Evaluate your security-related processes in the team ■ Threat might come internally also. For example: ● Bug from development team ● Human error when performing particular task upon infrastructures ○ Connect to your logging infrastructures with private connection ■ Use secure approach to be connected with your third party logging services ○ Deploy and manage your logging infrastructures under private network ■ For example, separate monitoring and logging infrastructure private network from warehouse, staging, production private networks.
  • 43. Let’s wrap up By enabling monitoring and logging systems, we might be able to: ● provide secure services ● ensure availability of our services ● build a better security practice ● give better experience for our users
  • 44. References ● Sentry ○ https://develop.sentry.dev/self-hosted/ ○ https://docs.sentry.io/product/ ● Elastic Cloud ○ https://www.elastic.co/guide/index.html ○ https://www.elastic.co/guide/en/kibana/current/index.html ● Prometheus ○ https://prometheus.io/docs/prometheus/latest/getting_started/ ○ https://prometheus.io/docs/alerting/latest/alertmanager/ ○ https://support.atlassian.com/opsgenie/docs/integrate-opsgenie-with-prometheus/ ● Security Practices, especially for Monitoring and Logging ○ https://sre.google/sre-book/table-of-contents/ ○ NEN 7510-2:2017 - 12.4 Reporting and monitoring -> https://www.webtoolmanagementsystemen.nl/en/ViewDocumentSection/d873e9df-44ae-413b- 8564-7ca7df60bde1/d873e9df-44ae-413b-8564-7ca7df60bde1/255021a3-1c42-4700-98f6-7f0 4eb16274f#8f13d102-3e26-4580-a20c-f4ae375725cb ○ ISO 27001:2013 - Annex A - A.12 Operations Security - A.12.4 Logging and Monitoring
  • 45. Special Thanks! ● PyCon Indonesia 2021 who made this possible! ● Kurnia Jaya Eliazar, Team Manager at NiceDay, for reviewing my slide and gave amazing feedbacks ● NiceDay Infrastructure Team, who gave me unlimited chances to implement and improve NiceDay infrastructures ● Former Ebizu Data Team, who gave me a lot of chances for exploring about AWS and Python application development on Big Data project. ● Bramandityo Prabowo, who used to teach me Python, Linux, Django and many things at the college
  • 46. Keep in touch ● Reach me at ○ E-mail: ridwanbejo@gmail.com ○ LinkedIn: https://www.linkedin.com/in/ridwan-fadjar-79781756/ ○ Github: https://github.com/ridwanbejo ○ Google Scholar: https://scholar.google.com/citations?hl=en&user=edU-dL8AAAAJ
  • 47. Q & A