Debugging Numerical Simulations on Accelerated Architectures - TotalView for OpenPOWER, CUDA and OpenMP

TotalView for
OpenPOWER, CUDA and
OpenMP
Chris Gottbrath
Dean Stewart
May 18, 2015
ScicomP 2015

Agenda
• Introduction
• TotalView Overview
• OpenMP, OMPT and OMPD
• Current work and future plans
• Questions and wrap-up
© 2015 Rogue Wave Software, Inc. All Rights Reserved

About Rogue Wave Software
Rogue Wave proven technical solutions simplify the growing complexity of
building and testing quality software code. Rogue Wave customers improve
software quality and ensure code integrity, while shortening development cycle
times.
Founded: 1989
Headquarters: Boulder, CO
Employees: 250
Offices Worldwide: 9

Company timeline
1989
Cross-platform
commercial math
& statistics
libraries for C++
Technology Timeline
Corporate Timeline
1994
First commercial
database library
for C++
1989
Rogue Wave
established
Tools.h++
1996
Rogue Wave
publicly listed on
NASDAQ
2003
Rogue Wave
acquired by
Quovadx
2007
Rogue Wave
spun out of
Quovadx by
Battery
Ventures
2009
Rogue Wave
acquires
TotalView
and Visual
Numerics
2010
Rogue Wave
acquires
Acumem
2008
First graphical
reverse
debugger for
C, C++ and
Fortran on
Linux
2011
First Infiniband
Cluster-capable
reverse debugger
First cache
optimization
product to market
2010
First GPU
enabled
commercial
analytics in
FORTRAN
2012
Rogue Wave
acquires
Visualizations
for C++
Audax Group
acquires
Rogue Wave
2013
Rogue Wave
acquires
OpenLogic &
Klocwork
1989 2015

Rogue Wave Solution Portfolio

HPC Trends
• What do we see
– NVIDIA Tesla GP-GPU computational accelerators
– Intel Xeon Phi Coprocessors
– Complex memory hierarchies (numa, device vs host, etc)
– Custom languages such as CUDA and OpenCL
– Directive based programming such as OpenACC and OpenMP
– Core and thread counts going up
• A lot of complexity to deal with if you want performance
– C or Fortran with MPI starts to look “simple”
– Everything is Multiple Languages / Parallel Paradigms
– Up to 4 “kinds” of parallelism (cluster, thread, heterogeneous, vector)
– Data movement and load balancing

How does Rogue Wave help?
• Troubleshooting and analysis tool
– Visibility into applications
– Control over applications
• Scalability
• Usability
• Support for HPC platforms and languages
TotalView debugger

Application Analysis and Debugging Tool: Code Confidently
• Debug and Analyse C/C++ and Fortran on Linux™, Unix or
Mac OS X
• Laptops to supercomputers
• Makes developing, maintaining, and supporting critical apps
easier and less risky
Major Features
• Easy to learn graphical user interface with data visualization
• Parallel Debugging
– MPI, Pthreads, OpenMP™, Fortran Coarrays
– CUDA™, OpenACC®, and Intel® Xeon Phi™ coprocessor
• Low tool overhead resource usage
• Includes a Remote Display Client which frees you to work
from anywhere
• Memory Debugging with MemoryScape™
• Deterministic Replay Capability Included on Linux/x86-64
• Non-interactive Batch Debugging with TVScript and the CLI
• TTF & C++View to transform user defined objects
What is TotalView®?

What Is MemoryScape®?
• Runtime Memory Analysis : Eliminate Memory Errors
– Detects memory leaks before they are a problem
– Explore heap memory usage with powerful analytical tools
– Use for validation as part of a quality software development process
• Major Features
– Included in TotalView, or Standalone
– Detects
• Malloc API misuse
• Memory leaks
• Buffer overflows
– Supports
• C, C++, Fortran
• Linux, Unix, and Mac OS X
• Intel® Xeon Phi™
• MPI, pthreads, OMP, and remote apps
– Low runtime overhead
– Easy to use
• Works with vendor libraries
• No recompilation or instrumentation

Deterministic Replay Debugging
• Reverse Debugging: Radically simplify your debugging
– Captures and Deterministically Replays Execution
• Not just “checkpoint and restart”
– Eliminate the Restart Cycle and Hard-to-Reproduce Bugs
– Step Back and Forward by Function, Line, or Instruction
• Specifications
– A feature included in TotalView on Linux x86 and x86-64
• No recompilation or instrumentation
• Explore data and state in the past just like in a
live process, including C++View transformations
– Replay on Demand: enable it when you want it
– Supports MPI on Ethernet, Infiniband, Cray XE Gemini
– Supports Pthreads, and OpenMP
– New: Save / Load Replay Information

Supported Platforms
Platforms C/C++ Compilers Fortran Compilers
Linux x86 Gcc, Intel, PGI Absoft, GNU, Intel, PGI
Linux x86-64 Gcc, Intel, PGI Absoft, GNU, Intel, PGI
Power Linux Gcc, XL C++ GNU, XL Fortran
RS6000 Power AIX Gcc, XL C++ XL Fortran
BlueGene Gcc, XL C++ XL Fortran

TotalView for the NVIDIA ® GPU Accelerator
• NVIDIA CUDA 6.0, 6.5 and 7.0
• Features and capabilities include
– Support for dynamic parallelism
– Support for MPI based clusters and multi-
card configurations
– Flexible Display and Navigation on the
CUDA device
• Physical (device, SM, Warp, Lane)
• Logical (Grid, Block) tuples
– CUDA device window reveals what is
running where
– Support for types and separate memory
address spaces
– Leverages CUDA memcheck

• The following 6 slides are from an SC14 tutorial by:
• Damian Alvarez
– d.alvarez.mallon@fz-juelich.de
• Dr. Mike Ashworth
– mike.ashworth@stfc.ac.uk
• Vincent Betro, Ph. D.
– vbetro@utk.edu
• Chris Gottbrath
– Chris.Gottbrath@roguewave.com
• Nikolay Piskun, Ph.D.
– Nikolay.Piskun@roguewave.com
• Sandra Wienke
– Wienke@itc.rwth-aachen.de
11.17.2014 SC ‘14

• Setting breakpoints in CUDA kernels
– Start debugging (e.g. “Go”)
– Message box when
kernel is loaded:
– Set kernel
breakpoints as in
host code
11.17.2014 SC ‘14

• Debugger thread IDs in Linux CUDA process
– Host thread: positive no.
– CUDA thread: negative no.
• GPU thread navigation
– Logical coordinates: blocks (3 dimensions),
threads (3 dimensions)
– Physical coordinates: device, SM, warp, lane
– Only valid selections are permitted
11.17.2014 SC ‘14

• Single Stepping
– Advances all GPU hardware threads within same warp
– Stepping over a __syncthreads() call advances all threads within
the block
• Advancing more than just one warp
– “Run To” a selected line
number in the source pane
– Set a breakpoint and
“Continue” the process
• Halt
– Stops all the host and
device threads
11.17.2014 SC ‘14
…
t0 t1 t31
…
t32 t63
…
warp
group of 32 threads
same program counter (PC)

• Displaying CUDA device properties
– “Tools” - “CUDA Devices”
– Helps mapping between
logical & physical coordinates
• PCs across SMs, warps, lanes
– valid, active, divergent
11.17.2014 SC ‘14
program
counter (PC)
within warp
…

• Displaying GPU data
– “Dive” into variable or
watch “Type” in “Expression List”
– Device memory spaces: “@”
notation
11.17.2014 SC ‘14
Storage Qualifier Meaning of address
@global Offset within global storage
@shared Offset within shared storage
@local Offset within local storage
@register PTX register name
@generic Offset within generic address space (e.g.
pointer to global, local or shared
memory)
@constant Offset within constant storage
@parameter Offset within parameter storage (TV built-
in type)

• Checking GPU memory
– Enable “CUDA Memory checking” during startup or in the “Debug”
menu
– Detects global memory addressing violations and misaligned memory
accesses
• Further features
– Multi-device support
– Host-pinned memory support
– MPI-CUDA applications
11.17.2014 SC ‘14
Note: Recent cuda-memcheck versions are
also able to detect race conditions:
cuda-memcheck -–tool racecheck <prog>

The Importance of OpenMP
• Programming models are changing to accommodate changes in system architectures
• Higher degree of on-node parallelism: many-core CPUs and/or GPUs
• Hybrid programming models: MPI+X, where OpenMP is an important X
– MPI across the nodes
– OpenMP shared memory parallelism across the cores in a node
• Why use OpenMP?
– The most widely used standard for SMP systems, implemented by many vendors
– Supports the Fortran, C, and C++ languages
– Relatively small and simple specification, and supports incremental parallelism
– OpenMP research keeps it up to date with the latest hardware developments
– OpenMP 4 allows targeting GPUs
• We see momentum building around OpenMP

OpenMP Debugging Challenges
• Programmers will attempt to exploit MPI+OpenMP hybrid parallelism
• Porting existing large applications from MPI to a hybrid model is nontrivial
and arduous, and having GPUs in the mix makes it even more challenging
• Programming errors such as memory corruption, logic errors and
concurrency bugs are inevitable
• Bottom line
– MPI+OpenMP+GPUs will present programmers with unprecedented
debugging challenges
– They need good debugging tools for MPI+OpenMP+GPUs

The following are some features that TotalView supports:
• Source-level debugging of the original OpenMP code.
• The ability to plant breakpoints throughout the OpenMP code, including
lines that are executed in parallel.
• Visibility of OpenMP worker threads.
• Access to SHARED and PRIVATE variables in OpenMP PARALLEL code.
• Access to OMP THREADPRIVATE data in code compiled by supported
compilers.
Debugging OpenMP Applications

Sample OpenMP Debugging Session
OpenMP
worker
threads
Local variables

OpenMP code high and low level
• Intention is expressed in the OpenMP code
– Serial-correct code with OMP directives expressing parallelism
– Higher level expression of the ideas
• Compiler can create either serial or parallel executable programs from this
source
• A parallel executable includes both the program logic and a runtime
– Teams of threads on the device, the host or both
– Outlined routines
– Runtime calls to dispatch work to worker threads
– Work created on thread A may be executed on threads M-N

What’s Needed?
• Debugging and performance analysis support from OpenMP implementations
• “OMPT and OMPD: OpenMP Tools Application Programming Interfaces for
Performance Analysis and Debugging” technical report (TR)
– First TR combined OMPT and OMPD in one document
– OMPD was redacted to allow OMPT to progress
• “TR2: OMPT: An OpenMP Tools Application Programming Interface for
Performance Analysis”
– Accepted by the OpenMP ARB (March 2014)
– OMPT is an API for first-party performance tools
• It is now time to circle back and finish OMPD!
– OMPD is similar to OMPT in its functionality, but…
– OMPD is an API for third-party debugging tools

What Does OMPT/OMPD Do?
• OMPT
– Enable performance tools to gather execution program/runtime costs
– Allow construction of low-overhead performance tools
– Allow logical stack unwinding (to handle outlined parallel regions)
– Provide the state of a thread at any point in time (e.g., idle, work, wait)
– Asynchronous signal safe
• OMPD
– Enable debugging tools to inspect the state of a live process or core file
– Third-party versions of the OpenMP runtime inquiry functions
– Third-party versions of the OMPT inquiry functions
– Intercept the beginning/end of parallel/task regions (e.g., stepping in/out)
– Enable the debugger to construct a “global view” of the process

How is OMPD Structured?
• Based on a commonly used idiom
– pthread thread_db, MPI MQD, MPI Handles, and others
– The OMPD DLL is “paired” with the OpenMP runtime library
• The debugger
– Attaches to the target OpenMP application
– Loads the OMPD DLL (e.g., via dlopen())
– Registers callbacks in the OMPD DLL
– Makes “requests” into the OMPD DLL to query runtime state
• The OMPD DLL
– Makes callbacks into the debugger (lookup symbols, read/write
memory, etc.)
– Returns the result to the debugger

OMPD DLL
OMPD DLL loaded into
debugger and callbacks
registered
OMPD “In Action”
Application
Process
OpenMP
Runtime Library
(RTL)
Application
address space
Attach
Debugger
Debugger
address space
Request
OpenMP
state
1
• Handles for threads, parallel regions, tasks
• Parent / child relationships
• State of handles (wait, work, idle)
1
Request types
Request
symbols and
address
information
2
• Lookup symbols in the target process
• Read/write target address spaces
• Support for GPUs
2
Callback functions
Callback ops
3Result

OMPD Status
• Collaboration between LLNL and Rogue Wave Software (TotalView)
– LLNL: Dong Ahn, Ignacio Laguna, Joachim Protze, Martin Schulz
– RWS: Ariel Burton, John DelSignore
• Resurrect OMPD with the ultimate goal of having it accepted by the ARB
– Fix the current OMPD spec
– Implement OMPD DLL in the Intel OpenMP runtime
– Implement OMPD-based features in TotalView
• IWOMP Paper (International Workshop on OpenMP)

TotalView 8.15
New Features
• Scalable Infrastructure
• Faster start up on Linux
• Scales to O(100,000) processes
& O(1,000,000) threads
• Updated CUDA support
• CUDA 7.0
• Support updates including:
• Clang 3.5
• Intel 15.0
• MPT 2.12
• SLES 12, Fedora 21
TV Client
MRNet CP MRNet CP
TV Server TV Server TV Server TV Server

TotalView’s Scalability Strategy
Multicast
Reduction
TotalView
uses an
“MRNet tree”
of servers
TV Client
MRNet CP MRNet CP
TV Server TV Server TV Server TV Server
Remain lightweight in the backend!
Smarts
Smarts
Push debugger “smarts”
to the backend, not the
whole debugger!
Use classic optimization
techniques too: caching,
hoisting invariants, etc.

Linux Start up Performance in TV 8.15.4
5x faster (600s / 120s) at 16k between 8.14.1 and 8.15.4.
Note that we switched to mrnet by default in 8.15.0

BG Start up Performance in TV 8.15.4
6.4x faster (180s / 28s) at 16k between 8.14.1 and 8.15.4.
Note that we switched to mrnet by default in 8.15.0

TotalView debugs 786,432 cores.
Climb with Rogue Wave towards
exacale.

Some more details on the 786,432 core test
• The test was performed on 48 racks of Sequoia
• The test code
– Implements a Jacobi Linear Equation Solver
– The test code is a hybrid MPI + OpenMP code
– 16 threads per process, one process per node
• The test operations
– Start up
– Setting breakpoints / removing breakpoints
– Single stepping all threads
• Tests performed at a variety of scales to understand scalability

TotalView’s Memory Efficiency
40
• TotalView is lightweight in the back-end (server)
• Servers don’t “steal” memory from the application
• Each server is a multi-process debugger agent
– One server can debug thousands of processes
– Not a conglomeration of single process debuggers
– TotalView’s architecture provides flexibility (e.g., P/SVR)
– No artificial limits to accommodate the debugger (e.g., BG/Q 1 P/CN)
• Symbols are read, stored, and shared in the front-end (client)
• Example: LLNL APP ADB, 920 shlibs, Linux, 64 P, 4 CN, 16 P/CN, 1 SVR/CN
Process VSZ (largest, MB) RSS (largest, MB) Where
TV Client 4,469 3,998 Front End ONLY
MRNet CP 497 4 Compute Nodes
TV Server 304 53 Compute Nodes

Future plans
• Contact sales@roguewave.com with any inquires about our future plans
with regard to TotalView product.

Thanks!
• Visit the website
– http://www.roguewave.com/products/totalview.aspx
– Documentation
– Sign up for an evaluation
– Contact customer support & post on the user forum

Debugging Numerical Simulations on Accelerated Architectures - TotalView for OpenPOWER, CUDA and OpenMP

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Debugging Numerical Simulations on Accelerated Architectures - TotalView for OpenPOWER, CUDA and OpenMP

Similaire à Debugging Numerical Simulations on Accelerated Architectures - TotalView for OpenPOWER, CUDA and OpenMP (20)

Plus de Rogue Wave Software

Plus de Rogue Wave Software (20)

Dernier

Dernier (20)

Debugging Numerical Simulations on Accelerated Architectures - TotalView for OpenPOWER, CUDA and OpenMP

Notes de l'éditeur