Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitoring with sentry, elk and prometheus

Django Application Monitoring
with Sentry, ELK and
Prometheus
By Ridwan Fadjar Septian
Cloud Infrastructure Engineer at NiceDay Nederland B.V.
PyCon ID 2021

Introduction
- My name is Ridwan Fadjar Septian
- Living in Bandung, Indonesia
- My career journey are:
- 2014 - 2016, Web Programmer by using PHP
- 2016 - 2017, Backend Engineer by using Django
- 2017, Backend Engineer for big data project by using AWS Lambda, AWS Kinesis, AWS
EMR + PySpark and AWS S3 for Data Lake. Also as Cloud Infrastructure Engineer
- 2017 - 2018, Backend Engineer by using Django. Also as Cloud Infrastructure Engineer
at NiceDay Nederland B.V.
- 2018 - Current, Cloud Infrastructure Engineer at NiceDay Nederland B.V. which is mostly
working with Google Cloud Platform
- My favorites
- Programming languages: Python and Javascript
- Web frameworks: Django
- Operating system: Linux
- My interests: Open Source Projects, AI, DevOps, Cloud Infrastructure, Software Engineering,
IT Governance, IT Security, Computer Networking, etc.

A. Company Background
- NiceDay Nederland B.V.
- Provide online mental healthcare provider since 2014
- Cover national market in Netherlands
- Planning to expand into international market
- Targetting to become a leader for mental healthcare service compete with
other companies in national sector
- Based in Rotterdam, NL
- Branch office in Bandung, ID
- +/- 50 employees Rotterdam and Bandung combined
- Came from diverse nationalities and background
- Visit us more here -> https://nicedaynederland.nl/en/home-en/

B. Problems
● How to provide secure services?
● How to ensure availability of our services?
● How to build a better security practice?
● How to give better experience for our users (therapists and clients)?

C. Goals
● Why we need monitoring and logging systems?
○ We are trying to give our users secure mental healthcare service
○ Highly available service for our users
○ Compliance with national, regional and international security standards
■ NEN 7510-02:2017 (Netherland’s national standard for health
information system security)
■ GDPR (Regional data security standard under European Union)
■ ISO 27001:2013 (International standard for information security
management system)
○ Better user experience for our users (therapists and clients)

D. Architectures of Our Application - An Overview

E. Monitoring and Logging Architectures Overview

E. Architectures - Elasticsearch and Kibana

E. Architectures - Prometheus, AlertManager and OpsGenie

E. Architectures - Prometheus and Grafana

F. Current Implementation - Elasticsearch + Kibana
● Elasticsearch + Kibana
○ Functions
■ Managing logs from Docker containers and hosts
■ Weekly log inspection
● Measures performance of our services (e.g. APDEX)
● Find any errors on Docker container logs or system logs
■ Root cause analysis on system or application logs per incident
■ Service endpoints deprecation
■ etc.
○ Ability
■ Retain all logs for more than years (long term)
■ Fast query on various logs for wide timerange

F. Current Implementation - Elasticsearch + Kibana (2)
● Deployment
○ Managed services at Elastic Cloud
○ Previously, we used Logstash to ingest Filebeat logs. But now, Filebeat
could send logs to Elasticsearch directly

F. Current Implementation - Sentry 10
● Sentry10
○ Functions
■ Manage bug / exception from our Django, Python, React.js and
React Native projects
● Bug management for every releases
■ Performance analytics tools for developers
■ Root cause analysis on application code level
● Bug tracing
○ Ability
■ Retain catched exceptions for years (long term)

F. Current Implementation - Sentry 10 (2)
● Deployment
○ On-premises at Google Cloud Platform
■ 3 VM instances to host Sentry 10 containers managed by container
orchestration
● E2-standard-4: vCPUs 4 cores, 16 GB of RAM
■ CloudSQL for Sentry10 database to store its event records
■ CloudStorage to host Sentry10 data
○ Sentry10 is quite complex. It should use Apache Kafka and Clickhouse
as its new data stores.

F. Current Implementation - Prometheus
● Prometheus + Grafana
○ Function
■ OKR evaluation
● Weekly
● Every 6 months
■ Root cause analysis by utilize server and application metrics
○ Ability
■ Retain resource and application metrics for a month (short term)

F. Current Implementation - Prometheus (2)
● Prometheus + Alert Manager + OpsGenie
○ Function
■ Services uptime monitoring
● Service performance whether its getting slower
■ VMs status monitoring
● Memory
● CPU
● Disk/IO
● Uptime
● etc.
○ Ability
■ Faster alerting system to Infrastructure Team
● Alert might come just under 1 minutes or 5 minutes
○ SMS
○ Push Notification
○ Phone Call
● OpsGenie will keep your phone ringing if you don’t response on it
yet.

F. Current Implementation - Prometheus (3)
● Deployment
○ On-premise at Google Cloud Platform
■ Single VM instance to host Prometheus and Alert Manager
● E2-standard-2: vCPUs 2 cores, 8 GB of RAM
■ Grafana is deployed at our container orchestration co-hosted with other
services for infrastructure team purposes.

F. Current Implementation - Security
We ensure the deployment of Prometheus, Elasticsearch + Kibana and Sentry by
applying this action:
- Deploy those tools under private network
- Only Infrastructure team have an access to those tools for managing purposes
- Every users for those tools have a least privileges.
- Only few person who become superadmin for administration purposes.
- Access to private network with 2FA enabled

F. Current Implementation vs The History Behind it
- Back to 2017, we have used New Relic as our monitoring tool.
- But it the capability for storing log from our servers and Docker containers weren’t
satisfying. Therefore, we built Elasticsearch on-premise cluster
- The alerting system weren’t satisfying also. So we built our alerting system by using
Prometheus on-premise
- Finally, we found that Sentry 9 was simpler than New Relic for managing exceptions
from our application. So we built our bug management by using Sentry 9
- 2019, Sentry and Prometheus moved to Google Cloud Platform as on premise
- We faced networking issue from local cloud provider. So we could deploy our
infrastructure in unstable situation.
- 2019, Elasticsearch + Kibana upgraded
- We moved Elasticsearch and Kibana to Elasticloud because the log size we managed
was nearly 1TB and its really hard to scale. Moreover, the networking issue was one the
main problem of that local cloud provider
- 2020, Sentry upgraded from version 9 to 10
- We moved to Sentry10 because we want to use the APM which provided by this new
version. But we still deploy it on-premise at Google Cloud Platform. The cost for Sentry
Cloud is quite expensive as its charged per num of developers in our company.

G. Usage examples - Prometheus

G. Usage examples - Prometheus + OpsGenie

G. Usage examples - Elasticsearch + Kibana

H. Impacts
● Those tools help us to provide secure services
○ Prometheus + OpsGenie
■ Warn us if SSL certificate are going to be expired.
○ Elasticsearch + Kibana
■ Weekly log inspection
● Anomaly in HTTP requests came to our services
○ Call to unknown endpoints
○ Strange number of requests that came exceeding
normal requests per seconds.
● Find someone suspicious who perform SSH beside from our
whitelisted users
● Find suspicious scripts which are being executed by CRON
● Find commands executed by whitelisted users which might
put our services in danger
○ Sentry
■ Find any parts of application that might led to bug
○ etc.

H. Impacts (2)
● Those tools help us to ensure availability of our services
○ Prometheus + OpsGenie
■ Faster response time upon incidents in our infrastructure 24/7
■ Improve our infrastructure by keep them optimized and efficient
● Reduce cost for underperforming VMs
■ Detect unapplied migration scripts from backend service
● It might led to crash for backend service if we can’t detect it earlier
■ High availability log inspection to help root cause analysis when incident
happened
● Find any errors output on Docker container logs across our
Docker-based services
● Find any errors output on system logs across our servers
■ We don’t have to SSH to our servers to find system error logs
■ We don’t have to check Docker logs to find service error logs
○ Sentry
■ We could configure Sentry to send OpsGenie alert. It could be triggered when
exception catched from our services.
○ etc.

H. Impacts (3)
● Those tools help us to build a better security practice
■ High availability log inspection to perform further root cause analysis
after incident happened last week or last month
○ Prometheus + Grafana,
■ Monitor incident response performance through various
sources
● MTTA, mean time to acknowledge
● MTTR, mean time to resolve
● MTBF, mean time between failure
● 99PTA, 99 percentiles time to acknowledge
● 99PTR, 99 percentiles time to resolve
■ Decide better strategies every new OKR period.
● For example, infrastructure team maintain its workflows
which related to NiceDay security practice

H. Impacts (4)
● Those tools help us to give better experience for our users
○ Sentry
■ Faster debugging process in their codebases for developers
● They could find how exception produced through amazing stacktrace
visualization
● They could see where exceptions catched from particular release
● They could find to the line which exceptions catched
● For example, backend team could debug Django and Celery codebase
easily and faster
● Etc.
■ Improve the backend service from performance analysis
■ Backend service endpoint deprecation
■ Help developers to find performance bottleneck of the service

H. Impacts (5)
● Other impacts
○ Stay compliance with some security standards for assurance to
clients.
○ Management could see the overview of service status when they
need it
○ Management could see in-house teams and products are growing
better
○ etc.

I. Best Practices
● Prometheus + OpsGenie, Refine your alerting rules periodically to be more suitable for
your team needs
● Whichever the tool
○ please enforce least privilege setup
■ Assign someone only what they need. Don’t give them role that are not
necessarily assigned out of their tasks
○ Enable two factor authentication when its possible
○ Setup process in your team to manage all credentials that you manage
■ You might utilizepassword managers (e.g. 1Password, DashLane,
BitWarden, LogMeOnce, etc.)
■ Manage secret key and password rotation to keep your monitoring
infrastructures secure
○ Evaluate your security-related processes in the team
■ Threat might come internally also. For example:
● Bug from development team
● Human error when performing particular task upon infrastructures
○ Connect to your logging infrastructures with private connection
■ Use secure approach to be connected with your third party logging services
○ Deploy and manage your logging infrastructures under private network
■ For example, separate monitoring and logging infrastructure private network
from warehouse, staging, production private networks.

Let’s wrap up
By enabling monitoring and logging systems, we might be able to:
● provide secure services
● ensure availability of our services
● build a better security practice
● give better experience for our users

References
● Sentry
○ https://develop.sentry.dev/self-hosted/
○ https://docs.sentry.io/product/
● Elastic Cloud
○ https://www.elastic.co/guide/index.html
○ https://www.elastic.co/guide/en/kibana/current/index.html
● Prometheus
○ https://prometheus.io/docs/prometheus/latest/getting_started/
○ https://prometheus.io/docs/alerting/latest/alertmanager/
○ https://support.atlassian.com/opsgenie/docs/integrate-opsgenie-with-prometheus/
● Security Practices, especially for Monitoring and Logging
○ https://sre.google/sre-book/table-of-contents/
○ NEN 7510-2:2017 - 12.4 Reporting and monitoring ->
https://www.webtoolmanagementsystemen.nl/en/ViewDocumentSection/d873e9df-44ae-413b-
8564-7ca7df60bde1/d873e9df-44ae-413b-8564-7ca7df60bde1/255021a3-1c42-4700-98f6-7f0
4eb16274f#8f13d102-3e26-4580-a20c-f4ae375725cb
○ ISO 27001:2013 - Annex A - A.12 Operations Security - A.12.4 Logging and Monitoring

Special Thanks!
● PyCon Indonesia 2021 who made this possible!
● Kurnia Jaya Eliazar, Team Manager at NiceDay, for reviewing my slide and
gave amazing feedbacks
● NiceDay Infrastructure Team, who gave me unlimited chances to implement
and improve NiceDay infrastructures
● Former Ebizu Data Team, who gave me a lot of chances for exploring about
AWS and Python application development on Big Data project.
● Bramandityo Prabowo, who used to teach me Python, Linux, Django and
many things at the college

Keep in touch
● Reach me at
○ E-mail: ridwanbejo@gmail.com
○ LinkedIn: https://www.linkedin.com/in/ridwan-fadjar-79781756/
○ Github: https://github.com/ridwanbejo
○ Google Scholar: https://scholar.google.com/citations?hl=en&user=edU-dL8AAAAJ

Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitoring with sentry, elk and prometheus

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitoring with sentry, elk and prometheus

Similaire à Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitoring with sentry, elk and prometheus (20)

Plus de Ridwan Fadjar

Plus de Ridwan Fadjar (20)

Dernier

Dernier (20)

Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitoring with sentry, elk and prometheus