Presentation from AWS re:Invent 2013. See session video here: http://www.youtube.com/watch?v=MjZdiDotRU8
Presentation is in two parts: (1) Introduction to moving workloads to the cloud, (2) deep dive on how the BBC moved their playout to the cloud.
4. Where AWS Fits Into Media Processing
Analytics and Monetization
Amazon Web Services
Playback
Track
Auth.
Protect
Package
QC
Process
Index
Ingest
Media Asset Management
6. Cloud Media Processing Approaches
Phase 1: Lift
processing from
the premises and
shift to the cloud
7. Lift and Shift
Media Processing
Operation
OS
Media Processing
Operation
OS
Storage
EC2
Storage
Media Processing
Operation
OS
EC2
Storage
8. The Problem with Lift and Shift
Monolithic Media Processing Operation
OS
EC2
Storage
Ingest
Operation
Postprocessing
Export
Workflow
Media Processing
Operation
Parameters
9. Cloud Media Processing Approaches:
Phase 2
Phase 1: Lift
processing from
the premises and
shift to the cloud
Phase 2: Refactor
and optimize to
leverage cloud
resources
10. Refactor and Optimization Opportunities
“Deconstruct monolithic media processing
operations”
–
–
–
–
–
–
Ingest
Atomic media processing operation
Post-processing
Export
Workflow
Parameters
12. Cloud Media Processing Approaches
Phase 1: Lift
processing from
the premises and
shift to the cloud
Phase 2: Refactor
and optimize to
leverage cloud
resources
Phase 3:
Decomposed, mo
dular cloud-native
architecture
13. Decomposition and Modularization Ideas
for Media Processing
• Decouple *everything* that is not part of atomic
media processing operation
• Use managed services where possible for
workflow, queues, databases, etc.
• Manage
–
–
–
–
Capacity
Redundancy
Latency
Security
14. in the Cloud
AKA “Video Factory”
Phil Cluff
Principal Software Engineer & Team Lead
BBC Media Services
15. Sources:
BBC iPlayer Performance Pack August 2013
http://www.bbc.co.uk/blogs/internet/posts/Video-Factory
• The UK’s biggest video & audio on-demand service
– And it’s free!
• Over 7 million requests every day
– ~2% of overall consumption of BBC output
• Over 500 unique hours of content every week
– Available immediately after broadcast, for at least 7 days
• Available on over 1000 devices including
– PC, iOS, Android, Windows Phone, Smart TVs, Cable Boxes…
• Both streaming and download (iOS, Android, PC)
• 20 million app downloads to date
17. What Is Video Factory?
• Complete in-house rebuild of
ingest, transcode, and delivery workflows for
BBC iPlayer
• Scalable, message-driven cloud-based
architecture
• The result of 1 year of development by ~18
engineers
19. Why Did We Build Video Factory?
• Old system
–
–
–
–
Monolithic
Slow
Couldn’t cope with spikes
Mixed ownership with third party
• Video Factory
– Highly scalable, reliable
– Completely elastic transcode resource
– Complete ownership
20. Why Use the Cloud?
• Background of 6 channels, spikes up to 24 channels, 6 days a week
• A perfect pattern for an elastic architecture
Off-Air Transcode Requests for 1 week
21. Video Factory – Architecture
• Entirely message driven
– Amazon Simple Queuing Service (SQS)
• Some Amazon Simple Notification Service (SNS)
– We use lots of classic message patterns
• ~20 small components
– Singular responsibility – “Do one thing, and do it well”
• Share libraries if components do things that are alike
• Control bloat
– Components have contracts of behavior
• Easy to test
22. Video Factory – Workflow
SDI Broadcast
Video Feed
Amazon Elastic
Transcoder
x 24
Broadcast
Encoder
SMPTE
Timecode
RTP
Chunker
Playout Video
Amazon S3
Mezzanine
Time Addressable
Media Store
Mezzanine Video Capture
Mezzanine
Elemental
Cloud
Live Ingest
Logic
Transcoded Video
Metadata
Playout
Data Feed
Transcode
Abstraction
Layer
DRM
QC
Editorial
Clipping
MAM
Amazon S3
Distribution
Renditions
25. Mezzanine Capture
SDI Broadcast
Video Feed
x 24
3 GB HD/1 GB SD
SMPTE
Timecode
Broadcast Grade Encoder
MPEG2 Transport Stream (H.264) on RTP Multicast 30 MB HD/10 MB SD
RTP
Chunker
MPEG2 Transport Stream (H.264) Chunks
Chunk
Concatenator
Chunk
Uploader
Amazon S3
Mezzanine
Chunks
Control
Messages
Amazon S3
Mezzanine
26. Concatenating Chunks
• Build file using Amazon S3 multipart requests
– 10 GB Mezzanine file constructed in under 10 seconds
• Amazon S3 multipart APIs are very helpful
– Component only makes REST API calls
• Small instances; still gives very high performance
• Be careful – Amazon S3 isn’t immediately consistent
when dealing with multipart built files
– Mitigated with rollback logic in message-based applications
27. By Numbers – Mezzanine Capture
• 24 channels
– 6 HD, 18 SD
– 16 TB of Mezzanine data every day per capture
• 200,000 chunks every day
– And Amazon S3 has never lost one
– That’s ~2 (UK) billion RTP packets every day… per capture
• Broadcast grade resiliency
– Several data centers / 2 copies each
29. Transcode Abstraction
• Abstract away from single supplier
–
–
–
Avoid vendor lock in
Choose suppliers based on performance and quality and broadcaster-friendly feature sets
BBC: Elemental Cloud (GPU), Amazon Elastic Transcoder, in-house for subtitles
• Smart routing & smart bundling
–
–
Save money on non–time critical transcode
Save time & money by bundling together “like” outputs
• Hybrid cloud friendly
–
Route a baseline of transcode to local encoders, and spike to cloud
• Who has the next game changer?
32. Example – A Simple Elastic Transcoder Backend
Amazon Elastic
Transcoder
XML
Transcode
Request
Get Message
from Queue
POST
Unmarshal and
Validate Message
Initialize
Transcode
SQS Message Transaction
POST
(Via SNS)
XML
Transcode
Status
Message
Wait for SNS
Callback over HTTP
33. Example – Add Error Handling
Amazon Elastic
Transcoder
XML
Transcode
Request
Get Message
from Queue
Dead Letter
Queue
POST
Unmarshal and
Validate Message
Initialize
Transcode
Bad Message
Queue
SQS Message Transaction
POST
(Via SNS)
XML
Transcode
Status
Message
Wait for SNS
Callback over HTTP
Fail
Queue
34. Example – Add Monitoring Eventing
Amazon Elastic
Transcoder
XML
Transcode
Request
POST
Get Message
from Queue
Unmarshal and
Validate Message
Monitoring
Events
Monitoring
Events
Dead Letter
Queue
Initialize
Transcode
Monitoring
Events
Bad Message
Queue
SQS Message Transaction
POST
(Via SNS)
XML
Transcode
Status
Message
Wait for SNS
Callback over HTTP
Monitoring
Events
Fail
Queue
35. BBC eventing framework
• Key-value pairs pushed into Splunk
– Business-level events, e.g.:
• Message consumed
• Transcode started
– System-level events, e.g.:
• HTTP call returned status 404
• Application’s heap size
• Unhandled exception
• Fixed model for “context” data
– Identifiable workflows, grouping of events; transactions
– Saves us a LOT of time diagnosing failures
36. Component Development – General Development &
Architecture
•
Java applications
–
–
–
•
Run inside Apache Tomcat on m1.small EC2 instances
Run at least 3 of everything
Autoscale on queue depth
Built on top of the Apache Camel framework
–
–
–
A platform for build message-driven applications
Reliable, well-tested SQS backend
Camel route builders Java DSL
•
Full of messaging patterns
•
Developed with Behavior-Driven Development (BDD) & Test-Driven Development (TDD)
–
•
Cucumber
Deployed continuously
–
Many times a day, 5 days a week
37. Error Handling Messaging Patterns
• We use several message patterns
– Bad message queue
– Dead letter queue
– Fail queue
• Key concept
– Never lose a message
– Message is either in-flight, done, or in an error queue somewhere
• All require human intervention for the workflow to
continue
– Not necessarily a bad thing
38. Message Patterns – Bad Message Queue
The message doesn’t unmarshal to the object it should
OR
We could unmarshal the object, but it doesn’t meet our
validation rules
•
•
•
•
Wrapped in a message wrapper which contains context
Never retried
Very rare in production systems
Implemented as an exception handler on the route builder
39. Message Patterns – Dead Letter Queue
We tried processing the message a number of times, and
something we weren’t expecting went wrong each time
•
•
•
•
Message is an exact copy of the input message
Retried several times before being put on the DLQ
Can be common, even in production systems
Implemented as a bean in the route builder for SQS
40. Message Patterns – Fail Queue
Something I knew could go wrong went wrong
•
•
•
•
Wrapped in a message wrapper that contains context
Requires some level of knowledge of the system to be retried
Often evolve from understanding the causes of DLQ’d messages
Implemented as an exception handler on the route builder
44. Please give us your feedback on this
presentation
MED302
As a thank you, we will select prize
winners daily for completed surveys!
Notes de l'éditeur
Media here refers to video and audio content. Maybe you’re a media and entertainment company or build apps and websites that work with user generated content.
Want to get a feel for the audience.Raise your hand if you do media processing in the cloud today.Raise your hand if you’re a developer.OK for those of you who are developers, have a nap and Phil will wake you up with a video in a few minutes.
Start by talking about media workflows. Main point is there are many workflows.Use media workflows to go from what’s on the the left to what’s on the right.Steps themselves are generally pretty straightforward.Industry trends that are making workflows more complex:More content: at the pro end look at all the content on the left. On the consumer end, everyone is carrying around a 1080p camcorder. And the more content there is, the greater the opportunity to monetize it.Bigger content: the industry is moving to some combination of more pixels, faster pixels and better pixels. More pixels: 4K and beyond (4x pixels compared to 1080p). Faster pixels: higher frame rates. 48fps is 2x current cinema frame rate. Better pixels: higher dynamic range and brighter pixels, increase bit-depth.More processing: the amount of processing going up not down. At the high end, whether it is a commercial, a TV show or a movie, most shows contain visual effects. Even in corporate video, color correction is becoming a standard part of the workflow. And at the consumer level, all those Instagram like filters require processing.More output formats: not just renditions based on devices but also versions. ? One senior industry figure recently told me that a piece of finished content will have been converted 1000 times!So all of these trends have an impact on workflows especially when you factor in constrained budgets and timeframes.
To give you context for what follows in Phil’s session, I thought I’d cover where AWS fits and then some approaches we’ve seen for doing media processing at scale in the cloud.As you know, AWS provides infrastructure services: compute, networking, database, storage and delivery and so on. We also provide application services and deployment and management services. Using these services, as your “software defined datacenter”, you can build media processing workflows.Typical operations in a media workflow would run on top of the AWS services. These operations could be provided by software that you’ve developed or they might be from another vendor like Aspera for ingest or Tektronix for video QC.On top of all that you’d have media applications – perhaps an Online Video Platform, a production management application, a digital dailies system or visual effects.So that’s where AWS fits. Now let’s look at some approaches for doing media processing on AWS.
A useful way to think about any kind of processing in the cloud is that there are 3 phases or approaches.
The first phase is simply taking what you do today and deploying it on AWS. This is the way a lot of people get started.
You take your on-premise deployment on the left and run it on EC2. Your media processing operation runs on an operating system and storage, both of which are provided by EC2. You can spin up multiple instances of these and that’s a way to give you scale and/or redundancy.But let’s look closer at this “lift and shift” approach.
Let’s break open that media processing operation black box and see what’s inside. What we find are discrete operations only one of which is the actual media processing operation – for example transcoding or scaling or feature extraction.So is there perhaps an opportunity to break apart the black box and derive some benefit?
That brings us to phase 2, which is about refactoring – or breaking things apart and putting them back together again in a different way – and optimizing your media processing operation. By doing this you might find ways to better use some of the features of AWS because we give you a lot of fantastic services for doing things like automatically scaling or distributing jobs or storing objects.
The cornerstone of phase 2 is to break apart monolithic operations. In our black box, we had these operations. Do they all need to happen inside one logical unit? Probably not. Are there benefits to breaking them apart? Absolutely. Why have each EC2 instance do its own ingest? Why have workflow that is an island?
So here’s a refactored example. What hasn’t changed is that we have our media processing operation – but only the operation itself – taking place on EC2 instances. But now we’ve using S3 to store the input content and the output content. Maybe we’ve used Aspera or some other ingest technology to get the content there. Then we’re using Simple Workflow to manage the workflow operations across the various EC2 instances and we’re using APIs to have each element talk to the other. This lets us use the scale of S3 and SWF so that you don’t need to worry about it. Also instead of having a handful of EC2 instances running the monolithic application, we can have a fleet of instances running the essential media processing operation – decoupled from the rest of the workflow – and the external workflow engine will send the media processing job to the appropriate instance. So if an instance has a problem, the job won’t go there giving you better resiliency. If an instance dies, another one can spin up automatically giving you redundancy.
The third phase builds on the second phase and decomposes your architecture still further. You’re now at the point where you are primarily writing or wrappering very atomic pieces of code that perform specific operations and leverage the AWS infrastructure for everything else.
Some ways to do this are to decouple everything: you want to understand which parts of the architecture need to know about the implementation details of another part. Chances are that they do not. You also want to make sure that if an operation fails somewhere, the job itself does not get lost and this is where workflow management and queues come in. You also want to design your components so that when you instantiate them, they figure out what they are supposed to do. For example, you might have a media processing worker that starts up an queries what kind of instance type it is running on so that it knows how much work it can do or if there are additional capabilities that it can advertise to the rest of the system.This is a good time to think about how you are managing the attributes that you really care about in your system.For capacity, where are the bottlenecks, what can you do when you need to overcome them?For redundancy, how do you make sure that each of your components are redundant?Is latency a concern? For many media processing operations, it probably is. So how can you manage that, reduce it an make it predictable?Are you architecting security into every component and layer of your system?So that concludes my brief overview of approaches to running media processing workloads on AWS.Now I’d like to welcomePhilCluff, the team lead for taking the BBC iPlayer video service into the cloud. He’s going to show you how they moved their broadcast playout to VOD system into AWS to give them scalability, reliability and elasticity.
Introduction:Phil CluffPrincipal Software Engineer & Team Lead @ BBC Media ServicesBeen with BBC for 3 ½ years, focused in Transcode architectures, Message Orientated Middleware & Reliable, Distributed systems in the cloud!I’m going to talk to you about BBC iPlayer and our journey into the Cloud.
Hopefully you’ve all heard of the BBC, but you may not have all heard of iPlayer.So What is BBC iPlayer?UK online population is about 40m which is the size of the state of California.
Now we’ll watch a short video produced by the BBC Director General, Tony Hall, which shows you where iPlayer has come from, and where we see It going in the future.
As I said, I’m here to talk to you about Video Factory.So what Video Factory?Read Slide plus:“We actually started building Video Factory 1 year ago this week – I was putting together the final designs for our transcode architecture before I flew out to re:Invent this time, last year”
As I said, I’m here to talk to you about Video Factory.So what Video Factory?Read Slide plus:“We actually started building Video Factory 1 year ago this week – I was putting together the final designs for our transcode architecture before I flew out to re:Invent this time, last year”
Old:Designed with a very ambitiousthroughput in mind, 5 years ago, but industry has moved on – new devices, delivery methods, throughput increases.New:Full control to deploy & manage our applications, and change quickly in a changing marketplace
Regional OPTs:18 channels, all on at once, 6 days a weekWant to transcode them all at the same time, but not to have those encoders hanging around idle at other timesPreviously have taken 9-12 hours for the queue to move through our systemIt’s news content – People want it while its still relevantNew system designed to cope with this (and more) throughput spikes
Be really clear on Mezzanine definition since next 4 slides depend on it.Mention Mez video capture is classic broadcast technologies.Make note of the “Time addressable media store”
We’re going to look at two areas in detail – Mez capture & transcode abstraction.
On-Premise encoders produce MPEG2 Transport Streams from SDI onto RTP MulticastCapture RTP and split into ChunksUpload Chunks to an S3 BucketRe-Construct Chunks only when required for Transcode
Vendor Lock in particularly important in SAAS models. I suggest you always have several options.
So let’s take a look inside our transcode abstraction layer
Blah blah…So let’s take a look inside an example transcode backend and think about how we might build one.
Mention that the transaction runs as long as the transcode – Camel renews
Give a one sentence summary of Camel.Give an overview ofBDD, TDD & Cucumber.Why is continuous deployment important.What happens if a deployment goes pear shaped?
We use several, the key concept in all of these is that you never loose a message
The message doesn’t un-marshal to the JaxB object it should. E.g.Not XMLDifferent type of messageWe could un-marshal the object, but it doesn’t meet our validation rules. E.g.Source must not be nullWrapped in a message wrapper which containsOriginal Message (Escaped)Exception MessageNever retriedAlways requires developer level interventionSuggests component version mismatchVery rare in production systemsSometimes caused by humans manually crafting messagesImplemented as an exception handler on the Route Builder
We tried processing the message a number of times, and something went wrong each time that we weren’t expecting. E.g.Dependent system is downNetwork connectivity issues(Frequently) “Completely unexpected code path”Message is an exact copy of the input messageCan be replayed directly onto the input queueMore detail about what caused it can be found in the Eventing framework (Splunk)Retried several times before being put on the DLQ3 – 5 is common24/7 Operations level interventionUsually to fix the dependent system, and then replay messagesCan be common, even in production systemsBut suggests you may need to improve dependent systems, or increase your retry countImplemented as a bean in the Route Builder for SQSCheck “Approximate Delivery Count” before attempting to do any processing on a message, and redirect the message to the DLQ if necessaryOr broker side (E.g. ActiveMQ)
Something I was expecting to go wrong, went wrong. E.g.State of a Dependent system wasn’t what is requiredA command line tool I use returned non-zeroBut I think the tool is likely dependable (IE a retry won’t help)Wrapped in a message wrapper which containsOriginal Message (Escaped)Exception MessageRequires some level of knowledge of the system to be retried24/7 Operations level intervention with RunBook, or Second Line supportWe have a console which un-wraps the message and replays itOften evolve from understanding the causes of DLQ’d messagesImplemented as an exception handler on the Route Builder