"Each year, the technical complexity of making the next great Walt Disney Animation Studios film increases. Animation and Visual FX studios continue to push the bounds of what is possible in computer graphics. This complexity drives rapid technological growth in both computational resources and storage to the point that it exceeds what we can physically provide with our on-premise compute cluster. As a result, we have started to adopt a hybrid approach with the cloud.
This session addresses the hurdles that animation and VFX studios face and focuses on automation of 'disposable' components (specifically infrastructure, licensing, fleet management, data and dependency management in a large-scale batch workload). We apply these general cloud techniques and utilities to an animation/VFX workload and push the limits with a very large scale cloud renderfarm deployment.
The team from Walt Disney Animation Studios walks through how they use cloud technologies to maximize render capacity. Learn how to leverage high-performance storage (like Amazon EFS), Amazon EC2 networking and the latest EC2 Spot features to provide a fully functional renderfarm at production-quality scale."
2. Visual Effects and
Animation
1
Who is using AWS for rendering?
3 Theme Parks
5 Gaming
Marketing2
4 Manufacturing
6 Life Sciences
7 Engineering and Architecture
7. The challenge of making a film
On-premises capacity
Rendering in the cloud
8. The challenge of making a film
On-premises capacity
Rendering in the cloud
Cloud provides you the capability to
scale fast and get the outputs faster
Initial project on-boarding
artwork
9. A tale of two customers
A boutique studio Walt Disney Animation Studios
On-Premises
Hardware
No or very little investment A significant investment
Licenses Limited Unlimited
Project
Structure
Project based from other studios Internal customers/projects
Budget
Constraints
Time and resources Time and resources
Compute Needs Large scale Very large scale
Infrastructure
Efficiencies
No or very little On-premises infrastructure optimized for
rendering workload
Cloud Model All-in mostly Hybrid mostly
Security Mandated by customers Required due to high valued assets
10. They both ask us the same thing…
The ability to spin up thousands of cores on-demand
…without any upfront investment
…and leveraging the most up-to-date configurations
A project-based “disposable” infrastructure
…with a flexible licensing / utility / by the hour
11. They both tell us the same thing…
=< $0.01
per core/hour
Access to thousands of
cores whenever needed
No upfront investments in
infrastructure
Easier collaboration
Ecosystem of software
providers
Access to large memory
configs to do 6K/10K renders
Project based “disposable”
infrastructure
12. …when the rubber meets the road !
Share FS everywhere Latency Large datasets Lots of instances
{Data/Content}
14. Rendering in the Cloud - State of the Union
Scale at a very cheap price
EC2 Spot
15. Leveraging Spot successfully today requires some
effort
Build stateless, distributed, scalable applications
Choose which instance types fit your workload the best
Ingest price feed data for AZs and regions
Make run time decisions on which Spot pools to launch in based on
price and volatility
Manage interruptions
Monitor and manage market prices across AZs and instance types
Manage the capacity footprint in the fleet
And all of this while you don’t know where the capacity is
Serve your customers
16. Spot Fleet
Instead of writing all that code to manage Spot instances,
simply specify:
• Target Capacity – The number of EC2 instances that you want
in your fleet.
• Maximum Bid Price – The maximum bid price that you are
willing to pay.
• Launch Specifications – # of and types of instances, AMI ID,
VPC, subnets or AZs, etc.
• IAM Fleet Role – The name of an IAM role. It must allow
Amazon EC2 to terminate instances on your behalf.
17. Spot Fleet Example – Instance Weighting
Say your workload needs at least 60 GB of memory
Want capacity to complete 20 units of work
Choices:
• r3.2xlarge (61.0 GB, 8 vCPUs) = 1 unit of 20
• r3.4xlarge (122.0 GB, 16 vCPUs) = 2 units of 20
• r3.8xlarge (244.0 GB, 32 vCPUs) = 4 units of 20
An option to bid for all of these instance types:
18. AWS cloud scale is “large”
• 10s/100s/1000s/10000s cores on-demand in the cloud
• A “large” (Disney Animation Studio) renderfarm:
55,000 cores
• In this demo:
~40,000 vCPUs on
EC2 Spot Market
Rendering in the Cloud - State of the Union
Scale at a very cheap price
19. • BYOL
• SaaS
• AWS Marketplace
• Elastic Licensing models
Thinkbox Deadline Usage Based Licensing
• Render nodes pull metered licenses from cloud-based license server
• Usage is tracked per minute
• Bulk minutes will be available via Thinkbox’s online store
• Store will eventually host 3rd party licensing (Nuke, VRay, etc.)
AutoDesk Maya
Rendering in the Cloud - State of the Union
Licensing at Cloud Scale
20. Rendering in the Cloud - State of the Union
Hydrating the Cloud Renderfarm
Amazon S3 as the source of truth for your content/data
• On AWS Marketplace/SaaS
(Aspera, Signiant, File Catalyst, Expedat)
• Amazon S3 Multi-part Upload
Direct to Shared File Systems
• Amazon EFS throughput scales linearly to the storage
• Lustre can hydrate from an S3 bucket
• Avere can be fronted to Amazon S3 or an
on-premises NAS
+ AWS Direct Connect
EFS
S3
Multipart
21. Rendering in the Cloud - State of the Union
Shared FileSystem Everywhere (some ideas)
Shared Storage
On-prem Storage
AWS Direct Connect
Storage Cache
Amazon S3
Luster on EC2
Avere on EC2
EFS
AWS Direct Connect
Hydrate workers
EC2 Spot
Shared Storage
FXT on-prem
22. Rendering in the Cloud - State of the Union
NFS/CIFS (Content/Data Share) Everywhere (some ideas)
Elastic File System
• Designed to support petabyte scale file systems
• Throughput scales linearly to storage
• Same latency spec across each AZ
• Thousands of concurrent NFS connections
• Works great for large I/O sizes
• Pay for only what you use not what you provision
• Managed with multi-copy durability
EFS
23. Rendering in the Cloud - State of the Union
Move the Graphic Artist to the Cloud …
• NVIDIA GPU based EC2 instances
• Teradici PCoIP
• Frame, Otoy
• Windows and Linux (VNC+VirtualGL)
24. Rendering in the Cloud - State of the Union
Managing your “disposable” infrastructure
Launch a CloudFormation stack
with all the infrastructure
resources for a specific project
Automatically scale the stack
as appropriate
AMI
CloudFormation
Template
CloudFormation
Terminate
Template
25. Rendering in the Cloud - State of the Union
The Crown Jewels
• AWS alignment with the latest MPAA cloud based application
guidelines for content security – August 2015
• VPC private endpoint for Amazon S3 – enables a true private
workflow capability
• Encryption & key management capabilities
• Amazon Glacier Vault for high-value media/originals
26. Rendering in the Cloud - A Sample Architecture
(All in Cloud Pipeline)
Shared Storage
Renderfarm
On-Prem Storage
Pipeline and License Manager
3D Modeler
Remote
App Visualization
AWS Direct Connect
Modeling Dumb Client
Storage Cache
Amazon S3
Avere on EC2
Scalable Renderfarm on EC2
Appstream or Teradici running on a G2 instance
Pipeline Manager running on EC2
G2
EC2 SPOT
EFS
Hydrate workers
EC2 Spot
27. Render Farm
Rendering in the Cloud - A Sample Architecture
(A Hybrid Pipeline)
Shared Storage
Renderfarm
On-Prem Storage
AWS Direct Connect
Storage Cache
Amazon S3
Avere on EC2
Scalable Renderfarm on EC2
EFS
Hydrate workers
EC2 Spot
On-premise
Renderfarm
EC2 SPOT
Cloud renderfarm as an
extension of on-prem renderfarm
FXT on-prem
Pipeline and License
Manager (also manage
cloud renderfarm)
29. Disney Animation Renderfarm
Renderfarm
Avere FXT cluster
WDAS Data Center
Renderfarm
Avere FXT cluster
Storage
Remote Data Center
Renderfarm
Avere FXT cluster
Remote Data Center
San Francisco
Los Angeles
Burbank
Artists
Redundant 10Gb
30. Disney Animation’s Environment
• 90% Red Hat Enterprise Linux 6, 8% MacOSX
• 1Gb/s Ethernet to clients, 10Gb/s to most servers
• Clients are bursty, not generally bandwidth constrained
• Major Applications:
• Hyperion (GI Renderer)
• Maya
• Houdini
• Nuke
• Coda (Scheduler)
31. Disney Animation’s Environment
• NFS v3 Everywhere
• 5-7 petabytes
• 500 TB working-set
• 100 TB/week of data churn
• Global namespace
• Lots of metadata operations
• Serve everything out of RAM/SSD
• Renderfarm Footprint
• 55,000 core renderfarm
• 1.1 million render hours per day
• 200,000-400,000 tasks per day
• Typical render
• 8-16 threads, 64 GB
• 3-5 hours per task
32. Disney Animation Renderfarm
Renderfarm
Avere FXT cluster
WDAS Data Center
Renderfarm
Avere FXT cluster
Storage
Remote Data Center
Renderfarm
Avere FXT cluster
Remote Data Center
San Francisco
Los Angeles
Burbank
Artists
Redundant 10Gb
virtual private cloud
Avere vFXT
Oregon
Spot Instances
10Gb Primary, 1Gb backup
EFS
33. Mostly Automated Deployment
• Pre-built EBS-backed AMI
• Heavily customized RHEL
• Python/Boto3
• Pass in how many resources and the minimum instance size
• Calculates resource weights
• Needs to calculate pricing
• User-Data
• Raids ephemeral disks if available for scratch space
• Integrate with on-premises environment (DNS, asset inventory,
Puppet)
• Creates EC2 tags
• Runs Puppet to pick up changes since AMI-build-time
• Joins the render queue and asks for work
• Scale-up/down still a manual process
42. Rendering in the Cloud vs. On-Premises
!"!!!!
!5,000!!
!10,000!!
!15,000!!
!20,000!!
!25,000!!
!30,000!!
1! 10! 20! 30! 40! 50! 60! 70! 80! 90!
RenderTime(s)
Frame #
EC2/EFS!
On!Prem!
Lower is better
43. Lessons Learned
• Use as many different instance types as you can. Especially older generations.
• Think about ways to modify your workload
• Use every Availability Zone
• Check your limits, especially your Amazon EBS limit and
VPC setup (address space)
• Resource-oriented bidding
• Diversified allocation
• Benchmark your workload and set pricing accordingly
• Set ONLY realistic pricing that you will pay for
• Don’t be afraid to ask for help or pre-planning your run from AWS
44. Conclusion
• Cloud rendering on AWS - State of the Union
Is getting stronger …
• Rendering forecast
Partly cloudy with a chance of all in the cloud…
• Future research
• Storage hydration
Distribute across many clients to saturate the EFS throughput
• Storage for processing
Read freely and lump the writes (for shared FS performance)
• Latency is killer
Atomic workflows within a single AZ/region
Caching appliances