1. Automated Profiling of Virtualized Media Processing
Functions Using Telemetry and Machine Learning
Rufael Mekuria(Unified Streaming), Michael J. McGrath (Intel), Victor
Bayon-Molino(Intel), Vincenzo Riccobene(Intel), Christos Tselios(Citrix),
Artem Dobrodub(Nokia), John Thomson (Onapp)
ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
2. - Context: Emerging 5G and 5G Multimedia technologies
- Research challenge
- Contribution
- TALE based profiling approach
- Virtualized media processing function
- Profiling using TALE and telemetry
- KPI Mapping using telemetry and machine learning
- Conclusion/Future work
Summary/Overview
ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
3. Context (1): Emerging 5G Network
Technologies
5G
RAN
Cloudification
Distribution
Virtualized
Network Functions
Network
Function
Virtualization
Management
&
Orchestration
Massive Mimo
Network slicing
Millimetre
Wave technologies
(26+ GhZ)
Multi-Access
Edge
Computing
Virtualisation
technologies
5G
Radio Access
5G
Cloud/NFV
virtualized
core
ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands
Small Cells
WLAN + LTE
Cloud
Operating System
e.g. hypervisor
4. Context(2): Advanced Media Services
Content encryption
Water
Marked streams
Personalized
advertisement
360 degree
High Dynamic Range
Personalized
video streaming
Content security
Emerging
Formats
(smarter pixels)
Many pixels
Point Cloud
Light Field
Personalized
captions (language)
Ultra HD
4K, 8K
High Frame
Rate
Increased
Bandwidth
of
5G radio link
-> more data
virtualized
5G cloud
native core
->
Smarter network
Smarter streaming
Computer Vision
5G Transcode
/transmux
AR/MR
ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
5. Edge cloud
Edge cloud
Core CloudAccess Network
User equipment
User equipment
Scalable data center
cloud infrastructure
Aggregation Network
4G LTE
LTE
Broadcast
5G
Wireless
Virtualized
Network
User equipment
Cloud-RAN
Radio
network
information
orchestration
Radio
network
information
MEC Cloud
Regional cloudvEPC
vIMS
Virtualized network infrastructure
Superfluidity: a flexible functional
architecture for 5G networks
Giuseppe Bianchi et al. Volume27, Issue9
TETT Journal
Special Issue: 5GPPP Feature Issue
September 2016
Pages 1178-1186
5G Context(3): Example edge converged cloud native 5G architecture
6. - Hardware independence (X86, ARM, GPU, FPGA)
- Time independence
- Scale independence scale from 1 to many users (millions)
- Location independence
- Reduce costly overprovisioning
Broad range of technologies needed (NFV,MEC,SDN etc…):
A core of advances in cloud/NFV technologies (hence our work)
Context (4): Design goals of Superfluid 5G network
https://www.unified-streaming.com/blog/5g-superfluidity-and-future-streaming-video
ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
7. + Rashid Mjumbi et al. Network Function Virtualization: State of the Art and Research Challenges
IEEE Communications surveys & tutorials, VOL. 18. NO 1. First Quarter 2016
http://www.etsi.org/
technologies-clusters/technologies/nfv
Context (5): Some NFV pointers
Difference NFV vs Cloud: service/function abstraction: VNF, per function optimization possible
ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
8. Context (6) Virtualisation Stack
Hypervisor type 1 Hypervisor type 2 Operating system level
* Mark Croes
Performance analysis of virtualized video streaming service
Bachelor Thesis, University of Amsterdam June 2017
ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
Our focus is type 1
virtualisation
9. R1: Performance modelling/work load characterization
-> virtual box in the cloud vs. physical box, underlying hardware is
heterogeneous, cloud operating system stack
R2: Reduce overprovisioning of underlying physical/virtual infrastructure
R3: KPI Mapping for more efficient telemetry/monitoring, identifying most
Important metrics relating to service quality defined in SLA
Goal: “efficient carrier grade cloud native processing functions”
this work is a step in this direction
Research Challenges
ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
10. - Cloud video streaming [1-5]
- Mathematical modelling (Jackson queue etc.) use QoS constraint
enable more efficient scale-in-out etc)
- Not sufficient for real deployment (anomaly, unexpected behavior)
- Not sufficient for NFV -> heterogeneous underlying hardware (entropy)
- Not Sufficient for MPEG DASH were client behavior is not standardized,
hence modelling user load difficult
- Google golden signals, CPU thread state, Off CPU analysis -> no automated
step mapping to Specific workload (domain knowledge needed), automated
mapping useful for function abstraction NFV
Prior art
[1] Wu et al., "CloudMedia: When Cloud on Demand Meets Video on Demand," in IEEE ICDCS, Minneapolis, 2011, pp. 268-277.
[2] Nan et al., "Optimal allocation of virtual machines for cloud-based multimedia applications," in IEEE MMSP Workshop, 2012, pp. 175-180.
[3] D. Niu et al., "Quality-Assured Cloud Bandwidth Auto-Scaling for Video-on-Demand Applications," in IEEE Infocomm, 2012.
[4] Y. Jin, Y. Wen, C. Westphal , "Optimal Transcoding and Caching for Adaptive Streaming in Media Cloud: An Analytical Approach,“
IEEE TCSVT, vol. 25, no. 12, pp. 1914 - 1925, December 2015.
[5] J. He, Y. Wen, J. Huang, D. Wu , "On the Cost-QoE Trade-off for Cloud-based Video Streaming under Amazon EC2’s Pricing Models,“
IEEE TCSVT, vol. 24, no. 4, pp. 669 - 680, September 2013.
11. - Automated profiling of virtualized media processing functions
to reduce overprovisioning in an NFV/Cloud deployment
- Automated KPI mapping of virtualized media processing functions, enabling
more efficient telemetry/targeted telemetry based monitoring (this can
enable more efficient scale-in-out etc)
- Emperical approach for cloud video streaming (we only focus on profiling
and KPI mapping)
Contribution
ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
12. TALE metric collection summary
Metrics are obtained from the cloud hardware: e.g. Intel SNAP,
OpenStack Ceilometer, Amazon Cloudwatch etc…. Metrics
obtained from OS/hypervisors, compute, storage, networks etc.
Throughput
Anomaly
Latency
Entropy
Full Stack Monitoring: collect as many metrics from each of the
layers, later then statistics/machine learning will be used to
identify key metrics -> we discuss the analytics pipeline
later
ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
13. Virtualized Media Processing Function
- Media, audio/video is a large percentage of the online traffic expected to continue in 5G
- Compute capabilities in network edge can improve video distribution [6] and [7]
- We consider a streaming function that can stream content using adaptive streaming using
HLS/DASH + captions and encrypt the content, all from a single media source based on
Unified Origin. NOTE: this is an Apache plugin
- In this paper the media processing function is deployed in the central cloud,
were it serves as origin server.
[6] Rufael Mekuria, Jelte Fennema, and Dirk Griffioen. 2016. Multi-Protocol Video
Delivery with Late Trans-Muxing. In Proceedings of the 2016 ACM on Multimedia
Conference (MM '16). ACM, New York, NY, USA, 92-96. DOI:
https://doi.org/10.1145/2964284.2967189
[7] https://www.unified-streaming.com/news/
finnish-telecom-leader-elisa-teams-unified-streaming-late-transmuxing
14. Performance of Live vs. VoD
- Performance analysis and planning model for live and VoD
- Basic setup source -> origin -> CDN -> Client
source origin CDN client
load on origin for increasing number of users
https://www.unified-streaming.com/blog/scaling-video-streaming-live-versus-vod
Does not converge for VoD!
15. Experimental Setup
[6] Rufael Mekuria, Jelte Fennema, and Dirk Griffioen. 2016. Multi-Protocol Video Delivery
with Late Trans-Muxing. In Proceedings of the 2016 ACM on Multimedia Conference (MM
'16). ACM, New York, NY, USA, 92-96. DOI: https://doi.org/10.1145/2964284.2967189
[7] https://www.unified-streaming.com/news/finnish-telecom-leader-elisa-teams-unified-streaming-late-transmuxing
Load generator by
Citrix
Simusers requesting
Different contents
2 vCPU 4GB Ram
Origin VM (apache)
OpenStack Liberty release (KVM based)
KPI Mapping
Framework
(TBD later)
Telemetry agent
Storage &
Visualisation
17. Profiling(2): Behavior tAle
- Fix/configure apache configuration of mpm (multi processing threading module)
- Result -> larger number connections and higher throughput 7Gbit (from 3.6 Gbps)
18. - Operational range leading to a breaking point
- Previous results only showed saturation
- How to identify/predict the breaking point ?
Profiling(3): Throughput tests (Tale)
19. Profiling(4): Anomalous Behavior tAle
- Interrupt storm dealing with interupts from I/O: NIC/storage
- VM exits (inter process communication hypervisor and VM) (at least one exit and
entry per interrupt)
- Overwhelms the CPU
- This is an example of virtualization overhead in hypervisor based virtualization,
should be accounted for in practical systems
20. Profiling(5): VM Exits caused by interrupts
Guest VM Hypervisor
Interuptable ?
No RFLAGS.IF=0
Set Interupt window exit
Wants to send interupt
Exit: reason interrupt window
- Record cause of an exit in VM exit
information
- Save processor state: control reg.
Debug reg., pending exception in
guest state area
- Save msr (machine specific
registers)
- Load processor based on host state
- After control is completed a similar
VM entry operation will happen
Steps in dealing
with interupts
when guest VM
is not interuptable
21. Profiling (7)
Latency and Entropy summary taLE
Single hardware/OS configuration -> hardware entropy not studied
Other entropic behavior not observed in test setup -> e.g. time of day, temperature
Analysis of subsystem metrics e.g. CPU utilization, memory etc. under
normal operational conditions did not reveal any unexpected behaviours
Undefined behavior after breaking point
22. KPI(1): Automated KPI Mapping
KPI Mapping framework Analytics Pipeline
Ensemble approach is taken combining the different algorithms, metrics with > 75%
Machine learning approach
with ensemble approach
23. KPI(2): KPI Mapping bootstrap
Throughput per client Received bytes per client (MB/s) Client
Latency per user request Delay to first byte received Client
Transaction Failure rate
Number of requests resulting in an
error
Client
Requests / second requests handled per second per user Client
Idea is to find the cloud metrics that can be measured by telemetry
that correlate most to these client side KPI’s
Client side metrics
(measured by load
generator)
24. KPI(3): KPI Mapping results
non-optimized network media processing function:
throughput and latency correlate to memory and network metrics
throughput:network 98 % latency:memory 80%
Optimized processing function: throughput and latency correlate to scheduler metrics
proc_schedstat_wake_up_local:throughput local wakeup calls
proc_schedstat_running:latency time running processes
25. Discussion highlights KPI/profiling
p1. Massive telemetry combined with Machine Learning
is useful for understanding virtaulized media processing function behavior
p2. Operational range identified, behavior predictable, but collapses beyond
Breaking point (not saturation) it will be important
to detect catastrofic failure before it happens
K1. non optimized KPI maps to subsystem metrics memory, network to Throughput/latency
K2. scheduler metrics map for optimized function to throughput and latency
K3. steady state and non steady state should be distinguished and detected
K4. then in both states the respective KPI mapping can be used to avoid catastrofic failure
K5. 85% reduction of metrics to be collected was achieved, by not collecting non relevant metrics
K6. 80% to 95 accuracy for throughput
26. Conclusion / Future Work
1. Systematic automated approach for KPI mapping and profiling
2. Not based on pre-defined analytical models
3. Based on real production technologies (OpenStack, Unified Origin),
results can be useful to enable KPI driven scaling, which can reduce overprovisioning
4. A small step in the goal “efficient carrier grade cloud native processing functions”
27. Thank You!
Contact me: Rufael Mekuria rufael@unified-streaming.com
Contact Vincenzo: vincenzo.m.riccobene@intel.com
This work is supported by the
European Union (H2020 RIA, GA No. 671566) Superfluidity.