2. Development of models begins at small scale.
Working on your laptop is convenient, simple.
Actual analysis, however, is slow.
2
3. Development of models begins at small scale.
Working on your laptop is convenient, simple.
Actual analysis, however, is slow.
“Scaling up” typically means a small server or
fast multi-core desktop.
Speedup exists, but for very large models, not
significant.
Single machines don't scale up forever.
3
5. High-Performance Computing involves many
distinct computer processors working together on
the same calculation.
Large problems are divided into smaller parts and
distributed among the many computers.
Usually clusters of quasi-independent computers
which are coordinated by a central scheduler.
5
7. Performance gains
High-end
workstation
Duration (s)
Number of cores
Performance test: stochastic finance model on R Systems cluster
High-end workstation: 8 cores. Maximum speedup of 20x: 4.5 hrs → 14 minutes
Scale-up heavily model-dependent: 5x – 100x in our tests, can be faster
No more performance gain after ~500 cores: why? Some operations can't be parallelized.
Additional cores? Run multiple models simultaneously
7
8. Performance comes at a price: complexity.
New paradigm: real-time analysis vs batch jobs.
Applications must be written specifically to take
advantage of distributed computing.
Performance characteristics of applications change.
Debugging becomes more of a challenge.
8
9. New paradigm: real-time analysis vs batch jobs.
Most small analyses are done in Large jobs are typically done in a
real time: batch model:
“At-your-desk” analysis
Submit job to a queue
Small models only
Much larger models
Fast iterations
Slow iterations
No waiting for resources
May need to wait
9
10. Applications must be written specifically to
take advantage of distributed
computing.
Explicitly split your problem into smaller
“chunks”
“Message passing” between processes
Entire computation can be slowed by one
or two slow chunks
Exception: “embarrassingly parallel”
problems
Easy-to-split, independent chunks of
computation
Thankfully, many useful models fall under
“Embarrassingly parallel” =
this heading. (e.g. stochastic models) No inter-process communication
10
11. Performance characteristics of applications change.
On a single machine: On a cluster:
CPU speed (compute)
Single-machine metrics
Cache
Network
Memory
File server
Disk
Scheduler contention
Results from other nodes
11
12. Debugging becomes more of a challenge.
More complexity = more pieces that can fail
Race conditions: sequence of events no longer deterministic
Single nodes can “stall” and slow the entire computation
Scheduler, file server, login server all have their own challenges
12
13. External resources
One solution to handling complexity: outsource it!
Historical HPC facilities: universities, national labs
Often have the most absolute compute capacity, and will sell
excess capacity
Competition with academic projects, typically do not include
SLA or high-level support
Dedicated commercial HPC facilities providing “on-demand”
compute power.
13
14. External HPC Internal HPC
Outsource HPC sysadmin
Requires in-house expertise
No hardware investment
Major investment in hardware
Pay-as-you-go
Possible idle time
Easy to migrate to new tech
Upgrades require new hardware
14
15. Internal HPC External HPC
No external contention
No guaranteed access
All internal—easy security
Security arrangements complex
Full control over configuration
Limited control of configuration
Simpler licensing control
Some licensing complex
Requires in-house expertise
Outsource HPC sysadmin
Major investment in hardware
No hardware investment
Possible idle time
Pay-as-you-go
Upgrades require new hardware
Easy to migrate to new tech
15
16. “The Cloud”
“Cloud computing”: virtual machines, dynamic allocation of resources in
an external resource
Lower performance (virtualization), higher flexibility
Usually no contracts necessary: pay with your credit card, get 16 nodes
Often have to do all your own sysadmin
Low support, high control
16
18. Global insurance company
Needed 500-1000 cores on a temporary basis
Preferred a utility, “pay-as-you-go” model
Experimenting with external resources for “burst”
capacity during high-activity periods
Commercially licensed and supported application
Requested a proof of concept
18
19. Cluster configuration
Application embarrassingly parallel, small-to-medium data files,
computationally and memory-intensive
Prioritize computation (processors), access to fileserver over
inter-node communication, large storage
Upgraded memory in compute nodes to 2 GB/core
128-node cluster: 3.0 GHz Intel Xeon processors, 8 cores per node for
1024 cores total
Windows 2008 HPC R2 operating system
Application and fileserver on login node
19
20. Stumbling blocks
Application optimization
Customer had a wide variety of models which generated different usage
patterns. (IO, compute, memory-intensive jobs) Required dynamic
reconfiguration for different conditions.
Technical issue
Iterative testing process. Application turned out to be generating massive
fileserver contention. Had to make changes to both software and hardware.
Human processes
Users were accustomed to internal access model. Required changes both
for providers (increase ease-of-use) and users (change workflow)
Security
Customer had never worked with an external provider before. Complex
internal security policy had to be reconciled with remote access.
20
21. Lessons learned:
Security was the biggest delaying factor. The initial security setup took over
3 months from the first expression of interest, even though cluster setup
was done in less than a week.
Only mattered the first time though: subsequent runs started much
more smoothly.
A low-cost proof-of-concept run was important to demonstrate feasibility,
and for working the bugs out.
A good relationship with the application vendor was extremely important
to solving problems and properly optimizing the model for performance.
21
23. Graphics processing units
CPU: complex, general-purpose processor
GPU: highly-specialized parallel processor, optimized for performing operations for
common graphics routines
Highly specialized → many more “cores” for same cost and space
Intel Core i7: 4 cores @ 3.4 GHz: $300 = $75/core
NVIDIA Tesla M2070: 448 cores @ 575 MHz: $4500 = $10/core
Also higher bandwidth: 100+ GB/s for GPU vs 10-30 GB/s for CPU
Same operations can be adapted for non-graphics applications: “GPGPU”
23
Image from http://blogs.nvidia.com/2009/12/whats-the-difference-between-a-cpu-and-a-gpu/
24. HPC/Actuarial using GPUs
Random-number generation
Finite-difference modeling
Image processing
Numerical Algorithms Group:
GPU random-number generator
MATLAB: operations on large arrays/matrices
Wolfram Mathematica: symbolic math analysis
Data from
http://www.nvidia.com/object/computational_finan
ce.html
24