At this year’s annual Design Automation Conference (DAC 2020), Rob Lalonde and Bill Bryce of Univa partnered with representatives from Google and Synopsys to discuss EDA in the Cloud and share best practices related to cloud migration.
1. 1
Tutorial & Best Practices:
Running EDA Workloads in the Cloud
Rob Lalonde, VP & GM Cloud
Bill Bryce, VP Products
2. 2
About Univa
• Leader in HPC workload management
• 250 global customers
• Hybrid, dedicated, private clouds
• 3.3M+ cores under management
• EDA, Manufacturing, Life Sciences, Oil & Gas,
Government, Research & Edu, Transportation
• Trusted by leading manufacturers
3. 3
Key Focus area: Optimize cloud workloads
• Accelerate regression testing with high-
throughput workload scheduling
• Share resources optimally between diverse
workloads and different design efforts
• Maximize EDA license utilization with license
orchestration software
Advanced workload management and
resource sharing
Cloud migration, automation, and
spend management
• Easily extend on-prem environments to the
cloud to meet peak-demand
• Deploy cloud resources optimally for each
simulation, place workloads correctly
• Maximize the efficiency of cloud resource
usage with automation and spend mgmt.
4. 4
2019 Univa InsideHPC cloud survey results
92%
Using or open to
HPC cloud - up
50% from 2017
64%
Say cloud has
proven value
or high
potential
See value in cloud
spend association
What we spend
BUT
76%
Have no
automated
solution
27%
Need help
27%
Manual
22%
Other
84%
< $10K
$10k to $100k
> $100k
27%
50%
34%
Dedicated
20%
Hybrid
47%
Both
Dedicated or Hybrid Cloud?
31%
In production
SLURM and Grid Engine represent
the majority of HPC cloud workloads
SLURM or
Grid Engine
54%
77%
Spend
monthly
8%
Power Users
75%
Univa sponsored survey – 2019 InsideHPC: Cloud Adoption for HPC: Trends and Opportunities
https://insidehpc.com/white-paper/cloud-adoption-for-hpc-trends-and-opportunities/
5. 5
What customers tell us
• Increasing design complexity, higher gate counts
• Need for higher quality & reliability driving coverage requirements
IoT, SoC embedded, medical devices, etc.
• Shorter product cycles, time-to-market
• Many simulation types: analog, digital, functional, system-level,
multi-physics, ML
• Need to maximize EDA tool utilization
• Limited data center capacity and IT budgets
More than any other industry, EDA users are
continuously challenged to do more with less
6. 6
A typical design environment
Interactive
users
License Server(s)
FlexNet Publisher
Project A
Project B
Project C
EDA Software
Licenses
License sharing
policies
General-
purpose
simulation
High-
throughput
servers
Place and
route
servers
Workload Management
Univa License
Orchestrator
Cloud InstancesOn-premise Infrastructure
Managed network, uniform DNS name-space Managed network, uniform DNS name-space
Cloud
APIs
• Gate Level Simulations (GLS)
• Register Transfer Level Simulations
• Transistor Level Modeling (TLM)
• Physical Verification
• Dynamic IR analysis
• Placement and clock optimization
• Static Timing Analysis (STA)
• Circuit Simulation
• Routing
Instance Provisioning
7. 7
Use case #1: Cloud automation
Boost license utilization, reduce Capex
• EDA environments frequently have “bursty
workloads” – overlapping projects, different
resources requirements at different phases
• For cloud to be practical, cloud provisioning
needs to be automated and transparent to users
• “Bring-your-own-image” functionality (BYOI) for
straightforward cloud migration
• Automate runtime decisions to avoid
administrator effort and potential human error
• Maximize EDA license utilization to improve
overall productivity
CHALLENGE:
• Bursty simulation & verification workloads
• Need to defer/reduce CapEx
• On-premise cluster right sized for day-to-day workloads
• On-premise EDA licenses underutilized
SOLUTION:
• Hybrid Cloud – Navops Launch, Univa Grid Engine
• Auto-scale cloud capacity based on workloads
• Automated data migration to and from the cloud
• Analytics and license management
BUSINESS VALUE:
• Avoid bottlenecks during critical tapeout periods
• Reduce costs - pay for cloud when needed
• Maximize on-premise license usage by shifting non-
licensed work to the cloud improving overall productivity,
Details at: https://blogs.univa.com/2020/01/mission-is-possible-
tips-on-building-a-million-core-cluster/
8. 8
Use case #2: Cloud simulation at extreme scale
Deploying a 1M+ vCPU cluster
• EDA verification and regression tests can run for
days accounting for approx. 80% of workloads
• Cloud capacity can dramatically reduce runtime
• Benefits: Reduced cycle time, more thorough
verification, higher quality, reduced schedule risk
• Many technical challenges solved: checkpointing,
reclaim rates, container registries, API calls etc.
CHALLENGE:
• Engineering design for next-gen hard disk drives
• Requires complex multi-physics simulations
• 2.5 million tasks require days on premise
• Need capacity for more complex designs
SOLUTION:
• Navops Launch – deployed 1M+ vCPU cluster in 90 mins
• 40,000 cloud instances, instances come and go
• Leveraged containerized workloads
• Lower costs with preemptible VMs, spot fleets
BUSINESS VALUE:
• ~60x reduction in runtime – 20 days to 8 hours
• Estimated 50% cost reduction vs on-prem resources
• Increased capacity for new product development
Details at: https://blogs.univa.com/2020/01/mission-is-possible-
tips-on-building-a-million-core-cluster/
9. 9
Use case #3: Optimize cloud instance selection
• Different tools have different requirements
• For licensed tools, it can be more economical to
underutilize machine resources!
• Optimizing selection is a function of license and
instance costs, and tool performance
40
60
80
100
120
140
160
180
200
1 2 3 4 5 6 7 8
Timepersimulation(s)
Simultaneous simulations per cloud instance
Instance A
Instance B
Where should we operate?
2 sims on instance A provides 37%
better throughput but requires 4x the
number of machine instances
compared to 8 sims on instance B
• Topology-aware placement yields further gains
(reducing simulation time, improving efficiency)
• Place workloads for socket/core affinity,
maximize cache per sim, NUMA considerations,
distribute load across memory & I/O channels
S C T T C T T C T T C T T C T T C T T C T T C T T
Example: AMD ROME EPYC 7Fx2 processor –Google Cloud N2D VMs
Closely controlling placement on VM drives greater efficiency
COMMON CHALLENGES FOR EDA SITES:
• Need reporting and license analytics to optimize selection
• Need smart policy-based instance selection at runtime
• Need granular resource scheduling / job placement
Instance selection Workload placement
10. 10
Use case #4: Share resources, manage spending
Share infrastructure and licenses
• Multiple project teams, multiple clusters
• Limited EDA feature licenses
• Need to allocate on-prem/cloud resources and
license features based on configurable policies
• Need to track actual cloud-spending and license
consumption by cost-center /project
• Automated mechanisms to throttle cloud
spending when budgets are exceeded
Manage cloud spend
SERVER MANAGED
LICENSES
FLEXERA
Publisher #1
FLEXERA
Publisher #1
Users
Cluster
LO CONNECTOR
Users
Cluster
LO CONNECTOR
Users
Cluster
LO CONNECTOR
(and additional
Tools)
11. 11
Summary
• Cloud can provide significant additional capacity to speed regression
tests and other EDA workloads
• The key to making cloud cost-efficient is automation, efficient
provisioning, and minimizing impact on existing applications
• Operating at scale requires specific software features for provisioning
and scheduling – it’s challenging to keep cloud-scale clusters busy!
• Placing workloads optimally is key to maximizing the use of EDA
licenses and improving overall throughput and efficiency
• Cloud spend association & management is critical – many
organizations lack automated mechanisms to track and control
spending