This document discusses how to architect WebRTC applications for scalability. It begins by outlining some of the challenges in building scalable WebRTC apps. It then presents 4 approaches to building apps: 1) To the WebRTC standard, 2) Unbundled WebRTC, 3) Using open-source media servers, and 4) Using communications platform as a service (CPaaS). Each approach has tradeoffs around cost, difficulty, and features included. The document also discusses using selectice forwarding units or multipoint control units to scale apps and considers architectures using orchestration and containers. It concludes with recommendations around optimizations, load testing, and future technologies.
2. @WebRTCventures
Alberto Gonzalez, CTO
Arin Sime, CEO
Agenda
• Why it’s not easy to build scalable apps
with WebRTC
• Open Source vs CPaaS
• So I just need an SFU or MCU to scale
my app?
• RTC orchestration and containers
• Stickiness, persistence and load testing
• Optimizations: App optimizations, media
optimizations (e.g: codecs)
How to Architect your
WebRTC application for
scalability!
3. WebRTC is not quite this simple…
• STUN/TURN servers
• Application Signaling
• Video codecs
• Browser/Mobile Support
• Recording
• Group chat/scaling
• Broadcasting
@WebRTCventures
4. 4 Ways to build your app…
1. To the standard, ie “build
your own stack”
2. Unbundled WebRTC
3. Open source media servers
4. CPaaS – Communications
Platform as a Service
@WebRTCventures
5. #1 – Building to the WebRTC Standard
• Compiling webrtc lib
• STUN/TURN servers
• Application Signaling
• Video/audio codecs
• Group chat/scaling
• Browser/Mobile Support
• Recording/Other Add-on features
• Can better utilize capabilities like
WebCodecs, WebTransport and
control low level details for specific
use cases
You must build and handle of the following –
with great power comes great responsibility!
https://webrtc.org
@WebRTCventures
6. #2 – Unbundling WebRTC
May be appropriate when you find yourself saying “I wish WebRTC would do this instead…”
@WebRTCventures
Typical WebRTC
media
server
Capture Encode Send Receive Decode Play
WebAssembly
media
server
Capture
Web
Codecs
Web
Transport
Web
Transport
Web
Codecs
Play
Diagram adapted from a presentation by Tsahi Levent-Levi on WebRTC Live, bloggeek.me
Unbundled WebRTC - Allows use of other standards to have more control of the codecs, transport, as well as add in insertable streams
7. #3 – Open Source Media Servers
Media Servers will handle:
• video/audio details
• part or all of the signaling
• Possibly STUN, TURN
• Scaling capabilities
• Could be SFUs or MCUs
• Browser/Mobile support
But you host/manage:
• All infrastructure and updates
Media
Servers
Your cloud servers
@WebRTCventures
8. #3 – Open Source Media Servers
Media
Servers
Your cloud servers
janus.conf.meetecho.com
jitsi.org
@WebRTCventures
Popular examples:
Pion.ly
mediasoup.org
LiveKit.io
9. #4 – CPaaS – Communications Platforms
Your
Application
servers
A CPaaS will handle:
• All WebRTC support / updates
• Media Servers
• STUN/TURN
• Web/Mobile Support
• Additional features like Recording, SMS,
Voice/VOIP, Transcription, etc
But you pay according to usage
CPaaS
@WebRTCventures
10. #4 – CPaaS – Communications Platforms
Your
Application
servers
CPaaS
@WebRTCventures
Popular examples:
11. It’s all about tradeoffs…
WebRTC
architecture
WebRTC
Standard
Unbundled
WebRTC
Open Source
Media
Servers
CPaaS
Up front cost High High Medium Low
Ongoing cost Low Low Low High
Technical
difficulty
High Medium-High Medium Low
Features
included
Low High* Medium High
@WebRTCventures
*Not really included, but you have flexibility to
build your own on top of the underlying APIs
And what about
intellectual
property?
Also, what
works for you
today does not
have to be your
long term
choice.
13. SFUs or MCUs can help scale WebRTC
MCU – Multipoint Control Unit
• Handles mixing of video/audio streams in a central server
so each participant only has one stream to deal with
SFU – Selective Forwarding Unit
• Each participant only connects to the SFU, but receives
unique streams for each participant
Either can add features beyond scaling
• Recording
• Broadcasting
• Interface to other services like transcription or VoIP legacy
systems
@WebRTCventures
14. MCU example
• Multipoint Control Unit
• Central server mixes all audio and video
• Participants only gets one downloaded stream
each for audio and video
• MCU controls a composited layout of that video for
everyone, which can be nice but also introduces
latency
• Heavy processing is required on a MCU, but offers
more predictable bandwidth requirements
Media Servers offering MCU capability (not a comprehensive list):
M
C
U
@WebRTCventures
15. SFU example
• Selective Forwarding Unit
• Routes the correct stream to each user
• Still unique streams for each participant
(allows for layout changes on user side)
• More powerful and more modern option but
more complicated implementation
• Lower server CPU required but more variable
bandwidth (based on # of users)
• Possible to do end-to-end encryption
Media Servers offering SFU capability (not a comprehensive list):
S
F
U
@WebRTCventures
16. Scaling beyond single media server applications
Depends on the use case… What happens if we have 1000+ viewers?
For large broadcasting applications:
@WebRTCventures
S
F
U
S
F
U
S
F
U
17. Scaling beyond single media server applications
For large multiparty video conferencing applications:
@WebRTCventures
S
F
U
S
F
U
18. Video group calls with telephony integration
@WebRTCventures
MCU IP-PBX
Phone
caller
User
2
Web
Publishers
User
1
User
1
Web
Web
Publishers
SFU
User
2
Phone
caller
User
3
IP-PBX
User
1
MCU
SIP/RTP
SIP/RTP
RTP
WebRTC
Scaling beyond single media server applications
19. Large Video Conferencing Architecture considerations
• Multiparty video conferencing support?
• Integration of multiple channels
• Integration with VoIP legacy systems
• Recording/voicemail and speech to text
@WebRTCventures
20. Orchestration and Containers in WebRTC applications to
Achieve Horizontal Scalability
Challenges
• Decouple media server from application logic
• Stateful system complexities
• Autoscaling / Downscaling
• Overprovisioning
@WebRTCventures
21. WebRTC Scalability Autoscaling Rules
Planning your autoscaling rules
• Connections threshold for autoscaling
○ More accurate than CPU/bandwidth
• Maximum number of sessions/rooms per server
• Maximum users per room
○ To make sure we can predict
• Desired resources buffer for quick spikes:
○ 1, 2 or even 10 servers ready?
@WebRTCventures
Example of users joining media servers at a different pace
22. WebRTC scalability, stickiness and persistence
Sticky Sessions
• We need all users in a call to use the same media server
• Generally needs additional app logic build to distribute
traffic accordingly
• Approaches:
○ Cookie based load balanced sticky sessions
○ Direct routing through initial auth
Data Persistence
• All servers need to be aware of the current situation of the
connections
• DB or Cache based storage systems can be used for
storing sessions information and distribute traffic
• PubSub mechanisms can be a good addition to decouple
and scale independently
@WebRTCventures
Basic WebRTC Scalability and High
Availability Architecture
23. WebRTC load testing: testing your scalable application
@WebRTCventures
Approaches
• Build your own
• Open Source
• Third party platforms
What do we want to validate?
• Connections and media received/sent
• Jitter/Round Trip Time (RTT)/Packet Loss
• Acceptable Mean Opinion Score (MOS)
24. Application and Media Optimizations today
What can you do?
• Simulcast or SVC
• Audio detection
• Adaptive bitrate based on resolution
• Opus RED and DTX (Discontinuous Transmission)
@WebRTCventures
WebRTC SFU
SVC example
WebRTC SFU
Audio #1
Audio #2
Audio #4
Audio #3
A#3
A#2
A#1
Receiving Opus RED (Redundant Audio Data) example
Missing
packet
25. Application and Media Optimizations tomorrow
What will be recommended soon?
• AV1 video codec*
• Lyra V2 audio codec*
• Other ML optimizations (e.g: Noise Reduction or
packet loss concealment)
@WebRTCventures
*It is possible to use it but performance encoding is not great due to average hardware not being ready and some
browsers and devices don’t support it yet
Lyra v2 Google open source results: https://opensource.googleblog.com/2022/09/lyra-v2-a-better-faster-and-
more-versatile-speech-codec.html
26. Thank you!
Learn more about us:
https://webrtc.ventures
Follow us on Twitter:
@WebRTCventures
Experts in live video app development for:
Telehealth, Broadcasting, Contact Centers, and More!
@lbertogon
@arinsime
Contact us at team@webrtc.ventures
Notes de l'éditeur
Janus built in C, Jitsi built using Java, MediaSoup built with C++ and Pion uses Go
Arin’s last slide - “and now Alberto will talk more about media servers and architectural use cases”
As Arin mentioned, there are different alternatives when building your own WebRTC application. In most cases direct peer to peer communication is really not an option..
In this diagram, at the right, you can see a representation of how a 8 participants peer to peer network would look like. Without a intermediate media server helping us is a bit messy, and specially resource intensive!
To reduce the amount of resources used at the edge, there are 2 popular architectures used to scaling this…
MCU arch and SFU arch
Now, how do we scale beyond a single MCU or SFU server? IT DEPENDS ON THE USE CASE
Media processing operations are very CPU intensive! A XLarge AWS instance, for example, can’t handle more than a couple hundred of sd quality video streams. What happens if we have 1000 viewers?
This is how the architecture in a very high level would look like
And what about other use cases?
So, in the case of larger video chat rooms, to have more than 50+ participants in the same room you might be reaching the limits of your infrastructure. It will be a good idea to use multiple MCU/SFU servers and each user can connect to them simultaneously!
There are smarter ways to do this but this is an approach.
And what about many concurrent connections or integrations with telephony systems like IP-PBX and so on. Well… that’s complicated but a simplified flow could look like this where we have web users connecting to an SFU/MCU from web or mobile and other users dialing in from the pstn at the right
But there are potential challenges caused by differences in: the SIP implementation, supported resolutions, RTCP Muxing supported or not, different codecs or in case of H264 different profile level ID that define it. ICE (trickle) support…
So what are the main considerations when deciding which of those architectures to use or implement?
One is really evaluating the number of participants (small groups will make things easier). Also, do we need advance functionalities like recording? dial in pstn?
Now how can we orchestrate this for horizontal scalability and handle 1000s of users calling concurrently? One proposed architecture could look like this diagram here where we have…
The challenges we are trying to solve are:...
stateful architectures: harder to handle crashes and scaling is difficult. We need to stick to the same server we were using before and reconnect. WebRTC doesn’t do caching like other stateless protocols like hls
When scaling, if we want to be efficient we want to use auto scaling rules. This rules will…
For example, in this charte we assume a connection threshold (this is the start of a new server) after 100 connections. And a server capacity of 400 connections. If our server start time is more than 60sec we wouldn’t have time to handle more than 5 new connections per second (or 300 connections in a minute spikes)
Another 2 important topics when building scalable and high available webrtc solutions is sticky sessions and data persistence. By sticky sessions I mean connections that will be maintained so user always reach the same server. This is important because…
Dayta persistence is what will allow us to distribute traffic between multiple concurrent services and to recover the session in case of a server failure
After that, once we have our scalable infrastructure ready we will need to load test what we build. To do that it won’t be enough to simulate http requests but we will need to simulate real video and audio traffic.
For that typically instances simulating participants are used to connect to the servers and send/receive media.
In addition to verifying we are effectively connected and receiving the right number of connections we will need to evaluate the performance by capturing jitter/rtt/packet loss so we can obtain an acceptable quality score or emos.
emos mean opinion score usually used to measure a/v quality
Finally, there are also application and media optimizations you can use to have better quality and have a better experience when scaling. One way is using simulcast or SVC..
SVC is a technique that allows encoding a video stream once in multiple layers. The layers in SVC can be subtracted while maintaining the video, reducing its quality with the reduction of each layer (fps, resolution or snr layers)
Another is just limiting the amount of video. Do you really need video? Some great tools just use audio (twitter spaces, clubhouse, slack huddle). In that case we can optimize a lot performance and reach a much larger audience! Or at least just get the video of the active participant using audio detection.
Adapt bitrate based on video resolution, this can be done on the fly without renegotiation
And for audio we have Opus RED and DTX, DTX shown in the diagram at the bottom right corner..
DTX simply eliminates audio when there is no audio at the other end…
With that would almost conclude our presentation, but there are many more configuration options and optimizations that will be available in the future.
Great teams of engineers are working on different fronts to surpass our current limitations.
For example, AV1 and Lyra V2. AV1 has been around for a while already but most devices doesn’t support its hardware acceleration yet so it causes too high cpu usage as of today.
Regarding Lyra v2: iOS and other embedded platforms are not supported at this time, but this may change in the future. You can see how lyra achieves same quality with much lower bitrates!
And, since before we talked about Opus DTX and how we stop sending audio when is not necessary…Some improvements can be made for better quality.
The process of dealing with the missing packets is called packet loss concealment (PLC). The receiver’s PLC module is responsible for creating audio (or video) to fill in the gaps created by packet losses, excessive jitter or temporary network glitches, all three of which result in an absence of data. To realistically continue short speech segments enabling it to fully synthesize the raw waveform of missing speech
And with that, here concludes our presentation about architecting your WebRTC application for scalability. I really hope you enjoyed and learned from this!
And if you are interested about this topics visit us or follow us on Twitter! Thank you so much for watching!