[Global logic] media high availability service

1
Confidential
Media High Availability Service
November 2020

Confidential
Nazariy Mamrokha - Engineering Director, Media
● More than 13 years of experience in software
development and Media domain. Including 10 years of
software development and 6 years of management
background.
● Strong experience in Media and Broadcasting domain.
● Architecting solutions for:
○ Media OTT applications (mobile, living room, gaming
consoles, Smart TVs, Web),
○ Backend Services (CMS/CDN, Streaming Services,
Subscription Management, Ad-Tech solutions,
Billing and Monetization, Analytics, Application
Store).
● Leading Media Program with 150+ engineers working
on 25+ projects
● Понад 13 років досвіду у розробці програмного
забезпечення та медіа-домені. У тому числі 10
років розробки програмного забезпечення та 6
років досвіду в менеджментів.
● Великий досвід у сфері Медіа та Мовлення.
● Розробка архітектурних рішення для:
○ Медіа/OTT додатків (мобільні, телевізійні
приставки, ігрові консолі, Smart TV,),
○ Бекенд-сервіси (CMS / CDN, потокові
сервіси, сервіси підписки, рішення Ad-Tech,
Монетизація та платежі, сервіси аналітики,
магазини додатків)
● Очолює програму з 25+ проектів у сфері Media,
загальною кількість 150+ людей

3
Confidential
3
Intro to Media

4
Confidential
End-to-End Video Content Lifecycle
Get your content and data
into the system
Perform all necessary content
manipulations
Play on any
device
Deliver to
end-user
MonetizeProduce
content
- Ingest
- Metadata
- Encode -
Transcode
- ABR
- Codecs
- Store
- Host
- Organize
- Scale
- Backup
- CDN
- ABR
- Packaging
- Encrypt
- DRM
- CAS
- Algorithms
- Apps
- Any
platform
- Any
device
- Subscription
management
- Ad Exchange
- Billing
- Manage
- Extract
- Archive
- Search
- Workflows
- Capture -
Edit
- Effects
- Workflows
- Finishing
Analytics
Content providers (TV
networks, studios, video
bloggers, etc.)
Service providers &
technology vendors (telecom,
broadband, CDN, ISVs, etc.)
OEM (connected
devices, consumer
electronics, etc.)
Engineering QA & Automation DevOps MigrationDesign Architecture Support
Industry
Content
GlobalLogic
Engineering Services

Confidential
Cloud video streaming platform
OTT-service-like cloud platform for video content delivery and monetization
VOD
Live
MAM
Metadata
management, CMS
cDVR SSAI
User management
CDN Clients
Metadata
VoD, Linear (HLS)
User profiles, Auth
API
Content with ads
Timeshifted,
TVPersonal
Recordings
VoD Library,
Scheduling, EPG
Ad management
Ad Insertion
Ad Tracking
Ad Decisions
Ingest, Transcode,
Playout, Package

Confidential
Serhiy Onanchenko - NOC Team Leader
● Over 18 years of professional experience in IT
industry
● Full stack developer, DBA, Linux/Windows
environments system administrator, network
engineer
● Supported production-grade ecosystems in Telecom
domain
● Managed support groups (30 members) providing
administration and monitoring services (24/7) for 350+
customers
● Currently manage 12 engineers NOC monitoring
multiple high loaded environments (up to 250K
RPS, 3000+ instances)
● Більше 18 років професійного досвіду в ІТ-
індустрії
● Full stack developer, DBA, адміністратор
Linux/Windows середовищ, інженер мережевого
обладнання
● Підтримував Supported екосистеми виробничого
рівня в домені телекомунікацій
● Був керівником груп підтримки (30 інженерів) які
займались адмініструванням та моніторингом
сервісів (24/7) для 350+ замовників
● В даний момент є менеджером NOC з 12 інженерів
який надає сервіси моніторингу для багатьох
високонавантажених середовищ
(до 250K RPS, 3000+ серверів)

7
Confidential
7
NOC from scratch

8
Confidential
1.NOC - who we are ?
- team structure
- scope
2.Incidents management
3.Monitoring toolset
4.Monitoring challenges
and
best practices
Agenda

9
Confidential
9
NOC - who are are?

Confidential
Current Team structure
1 Team Leader
12 NOC Engineers (2 people per shift)
● Linux, Windows systems
administration, automation
scripting
● Cloud computing and networks
● Web applications and servers
architecture, HTTP, REST API
● Monitoring tools and principles
● Strong troubleshooting and
problem-solving skills
● Good English language skills

Confidential
Questions to audience
Poll #1
What is the largest environment you supported ?

Confidential
● 5+ products, 1000+ B2B customers
● 9+ AWS production environments
● Microservices, Kubernetes clusters
● 3000+ running instances
● up to 250K RPS
Scope
Availability target: up to 99.995% =
max 30.2 sec of downtime weekly
MTTA
(Mean Time to Acknowledge)
Target - 1 minute

Confidential
Responsibilities
● Infrastructure, Services monitoring
● Incident management and documenting
● Monitoring systems and checks maintaining,
implementations of new metrics and monitoring scenarios
● Keep and update a directory of all 3rd parties
● One focal point that always knows the service level and issues status
● Defining reliable and preventive monitoring requirements as part of the product development life cycle
● Communication, coordination, collaboration

15
Confidential
15
Incidents management

Confidential
Incident management process

Confidential
What incident management tools you used to work with ?
Poll #2

18
Confidential
18
Monitoring toolset

Confidential
Monitoring toolset OpsGenie
Prometheus
Grafana
Amazon CloudWatch
PRTG
Dotcom-Monitor
Foglight
Witbe robot
Youbora
Logz.io
+
multiple
custom scripts/sensors

Confidential
What monitoring tools do you use for
production environment monitoring ?
Poll #3

Confidential
Youbora Analytics

Confidential
Witbe Robots
Witbe robots for end-to-end scenarios
testing on any device (PC, smartphone,
STB) and Quality of Experience (QoE)
monitoring.

26
Confidential
26
Monitoring challenges
and
best practices

Confidential
Monitoring challenges
● Mix of infrastructures setups and products
● Black Box monitoring
● Noise and false-positives
● Anomalies detection
● Multiple communication channels
● Complicated and long Runbooks Human in the middle
real-time operations

Confidential
SRE Golden Signals to monitor
There are three common methodologies:
● From the Google SRE book: Latency, Traffic, Errors, and
Saturation
● USE Method (from Brendan Gregg): Utilization, Saturation, and
Errors
● RED Method (from Tom Wilkie): Rate, Errors, and Duration
Useful references:
#1 #2 #3

Confidential
The USE Method
Methodology for analyzing the performance of any system
A summary of USE is
“For every resource, check utilization, saturation, and errors.”
Resource: all physical server functional components (CPUs, disks,...)
● Utilization: the average time the resource was busy servicing work
● Saturation: the degree to which the resource has extra work
which it can’t service, often queued
● Errors: the count of error events

Confidential
The RED Method
Methodology for services analysis
A summary of RED is
“For every service, check rate, errors, and duration.”
● Rate: the number of requests per second
● Errors: the number of those requests that are failing
● Duration: the amount of time those requests take

Confidential
Anton Bil - Senior Software Engineer
● Over 8 years of professional experience in IT industry
● Strong experience Linux/Windows environments
system administrator, DevOps, SRE
● As a SRE supported highly loaded infrastructures with
more than 7,000+ servers. Media and CND services.
● Currently works as SRE which provides services in
support, optimization and automation in high loaded
environments (up to 250K RPS, 3000+ instances)
● Більше 8 років професійного досвіду в ІТ-
індустрії
● Великий досвід у адмініструванні Linux/Windows
середовищ, DevOps, SRE
● Як SRE підтримував високонавантажені
інфраструктури з більше ніж 7000+ серверів.
Media і CDN сервіси
● В даний момент SRE який надає послуги в
підтримці, оптимізації і автоматизації
високонавантажених середовищ
(до 250K RPS, 3000+ серверів)

33
Confidential
33
SRE -
who is Site Reliability Engineer?

Confidential
Poll #4
Does your organization formally use
Site Reliability Engineering?

Confidential
Poll #5
How many incidents are happening
during the changes?

Confidential
38
How to achieve stability in media
products?

Confidential
39
“Day in the life of SRE”
1. Monitoring, Alerts management
2. Deployments
3. Automation
4. Processes/Documentation
5. Incident management

Confidential
Official information sources by Google:
books
online course

Confidential
Summary
● Media/OTT streaming industry is constantly raising and skyrocketing
because of COVID
● Reliability is a key to sustain daily streaming of millions of hours
● Requires constant Quality of Service monitoring
● Requires 24x7 support across the world

[Global logic] media high availability service

Recommandé

Recommandé

Contenu connexe

Similaire à [Global logic] media high availability service

Similaire à [Global logic] media high availability service (20)

Plus de GlobalLogic Ukraine

Plus de GlobalLogic Ukraine (20)

Dernier

Dernier (20)

[Global logic] media high availability service