Join us and learn more about how Dell PowerEdge C4140 Rack Server, powered by four of NVIDIA V100s, the world’s most powerful GPU, address training and inference for the most demanding HPC, data visualization and AI workloads. This enables organizations to take advantage of the convergence of HPC and data analytics and realize advancements in areas including fraud detection, image processing, financial investment analysis and personalized medicine.
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Dell and NVIDIA for Your AI workloads in the Data Center
1. Helge Gose, NVIDIA Solution Architect, June 7, 2018
DELL AND NVIDIA FOR YOUR AI
WORKLOADS IN THE DATA CENTER
2. 2
AGENDA
What is Deep Learning?
Volta and NVLINK
Inference to Training – Dell solutions
3. 3
THE TIME HAS COME FOR GPU COMPUTING
1980 1990 2000 2010 2020
103
105
107
1.5X per year
1.1X per year
Single-threaded perf
GPU-Accelerated
Computing
4. 4
DEEP LEARNING
IS SWEEPING ACROSS INDUSTRIES
INTERNET SERVICES
MEDICINE MEDIA & ENTERTAINMENT SECURITY & DEFENSE AUTONOMOUS MACHINES
Cancer cell detection
Diabetic grading
Drug discovery
Pedestrian detection
Lane tracking
Recognize traffic signs
Face recognition
Video surveillance
Cyber security
Video captioning
Content based search
Real time translation
Image/Video classification
Speech recognition
Natural language processing
INTERNET SERVICES
6. 6
A NEW COMPUTING MODEL
Algorithms that learn from examples
TRADITIONAL APPROACH
Requires domain experts
Time-consuming experimentation
Custom algorithms
Not scalable to new problems
DEEP NEURAL NETWORKS
Learn from data
Easily to extend
Accelerated with GPUs
MACHINE LEARNING
Car
Vehicle
Coupe
Car
Vehicle
Coupe
DEEP LEARNING
7. 7
WHAT PROBLEM ARE YOU SOLVING?
Defining the AI/DL Task
BUSINESS
QUESTION
AI/DL TASK
EXAMPLE OUTPUTS
HEALTHCARE RETAIL FINANCE
Is “it” present
or not?
Detection Cancer Detection Targeted ads Cybersecurity
What type of thing
is “it”?
Classification Image Classification Basket Analysis Credit Scoring
To what extent is
“it” present?
Segmentation
Tumor Size/Shape
Analysis
Build 360º
Customer View
Credit Risk Analysis
What is the likely
outcome?
Prediction
Survivability
Prediction
Sentiment &
behavior recognition
Fraud Detection
What will likely
satisfy the objective?
Recommendations
Therapy
Recommendation
Recommendation
Engine
Algorithmic
Trading
INPUTS
Text Data Images
AudioVideo
9. 9
TESLA V100
WORLD’S MOST ADVANCED DATA CENTER GPU
5,120 CUDA cores
640 NEW Tensor cores
7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS | 125 Tensor TFLOPS
20MB SM RF | 16MB Cache
16GB/ 32GB HBM2 @ 900GB/s | 300GB/s NVLink
10. 10
REVOLUTIONARY AI PERFORMANCE
3X Faster DL Training Performance
3X Reduction in Time to Train Over P100
0 10 20
1X
V100
1X
P100
2X
CPU
Relative Time to Train Improvements
(LSTM)
Neural Machine Translation Training for 13 Epochs |German ->English, WMT15 subset | CPU = 2x
Xeon E5 2699 V4
15 Days
18 Hours
6 Hours
Over 80X DL Training
Performance in 3 Years
1x K80
cuDNN2
4x M40
cuDNN3
8x P100
cuDNN6
8x V100
cuDNN7
0x
20x
40x
60x
80x
100x
Q1
15
Q3
15
Q2
17
Q2
16
Exponential Performance over time
(GoogleNet)
SpeedupvsK80
GoogleNet Training Performance on versions of cuDNN
Vs 1x K80 cuDNN2
11. 11
END-TO-END PRODUCT FAMILY
TRAINING INFERENCE
Jetson
Drive PX
Dell PowerEdge
C4140
DATA CENTER
TITAN V
TESLA V100
DESKTOP
DGX Station
DATA CENTER
TESLA V100
TESLA P4
EMBEDDED AUTOMOTIVE
DriveWorks SDKJETPACK SDK
12. 12
POWERING THE DEEP LEARNING ECOSYSTEM
NVIDIA SDK Accelerates Every Major Framework
COMPUTER VISION
OBJECT DETECTION IMAGE CLASSIFICATION
SPEECH & AUDIO
VOICE RECOGNITION LANGUAGE TRANSLATION
NATURAL LANGUAGE PROCESSING
RECOMMENDATION ENGINES SENTIMENT ANALYSIS
DEEP LEARNING FRAMEWORKS
Mocha.jl
NVIDIA DEEP LEARNING SDK
developer.nvidia.com/deep-learning-software
14. 1414 of 21
PowerEdge C4140 Server
Faster time to insights with ultra-dense accelerator optimized server platform
* Based on Dell internal analyses and Principled Technologies Report - Jan 2015.
TARGETED W ORKLOADS
THE BEDROCK OF THE MODERN DATACENTER
• Machine Learning and Deep
Learning
• Technical Computing (Research /
Life Sciences)
• Low latency, high performance
applications (FSI)
Key Capabilities
• Unthrottled performance and superior thermal efficiency with patent-pending interleaved
GPU system design*
• No-compromise (CPU + GPU) acceleration technology up to 500 TFLOPS / U+
using the
NVIDIA®
Tesla™
V100 with NVLink™
• 2.4KW PSUs help future-proof for next generation GPUs
• Simplified deployment with pre-configured Ready Bundles
Xeon Scalable
Processors
Tesla
GPUs
+
Based on V100 NVLink Tensor Core Performance
15. 1515 of 21
C4140 – Now with NVIDIA®
Volta™
and NVLink™
Faster time to insights with ultra-dense accelerator optimized server platform
THE BEDROCK OF THE MODERN DATACENTER
NVIDIA®
Volta GPU has over
21 Billion Transistors and
640 Tensor cores to deliver
100+ TFLOPS
NVIDIA®
NVLink™
is a high-
bandwidth interconnect
enabling ultra fast
communication between
CPU and GPU and between
GPUs
Volta V100 performs 2.6X avg. speed up for DL workloads than Pascal P100
Delivers 44X more throughput compared to CPU nodes with lower latency
NVLink 5X – 10X faster than traditional PCIe Gen3 Interconnect
Volta-Optimized Software for important HPC applications
*Source: NVIDIA® Volta benchmarks for multiple applications 2017
16. 1616 of 21 THE BEDROCK OF THE MODERN DATACENTER
C4140 and NVLink™
NVLink Topology
NVLINK is 25Gbps versus PCIe at 8Gbps
Increase in performance due to higher clock speed – 7%
Increase in performance Peer to Peer GPU communication – 7%+
PCIe Topology
18. Dell - Internal Use - Confidential18
Towers ModularRacks
Extreme Scale
Infrastructure
INDUSTRY'S #1
Server Portfolio
PowerEdge
THE BEDROCK OF THE MODERN DATA CENTER
*Based on units sold (tie). IDC Worldwide Quarterly Server Tracker, Q1-Q3, 2016.
18
OpenManage Enterprise – Intelligent Automation Systems Management
Now Introducing C4140
19. Dell - Internal Use - Confidential19 Dell - Internal Use - Confidential
ACCELERATE YOUR BUSINESS ON
PowerEdge
ADAPT AND SCALE
your dynamic business needs
by leveraging Scalable
Business Architecture
FREE UP SKILLED
RESOURCES
and focus on core business
with Intelligent Automation
PROTECT YOUR
CUSTOMERS
and your business robustly
with Integrated Security
THE BEDROCK OF THE MODERN DATA CENTER19
First, let’s start with some definitions…
AI is a broad field of study focused on using computers to do things that require human-level intelligence. It’s been around since the 50’s, playing games like tic-tac-toe and checkers, and inspiring scary sci-fi movies. But it was limited in practical applications…
ML is an approach to AI that uses statistics techniques to construct a model from observed data. It generally relies on human-defined classifiers or “feature extractors” that can be as simple as a linear regression, or the slightly more complicated “Bag of Words” analysis technique that made email SPAM filters possible.
This was really handy in the late 1980’s when lots of email started showing up in your inbox
But then we invented smartphones, webcams, social media services, and all kinds of sensors that generate huge mountains of data and the new challenge of understanding and extracting insights from all this “big data”.
DL is a ML technique that automates the creation of feature extractors using large amounts of data to train complex “deep neural networks”
DNNs are capable of achieving human-level accuracy for many tasks, but require tremendous computational power to train
Several years ago, researchers started applying DNNs in a variety of areas and reporting amazing results…
==============
Ref. https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering
Now that we’ve seen a few examples of applications that benefit from Deep Learning, and have a basic understanding of why we are seeing such rapid adoption across a wide range of use cases…
Let’s explore how deep learning works by comparing it with earlier approaches to machine learning.
Consider the traditional approach to performing computer vision tasks such as image classification or object detection.
A domain expert trained in computer vision comes up with a set of rules to extract features from the image – such as edges, corners, color information, … maybe even counting the number of wheels, headlights, etc. The expert must figure out which features in the data are important, implement these rules by hand-writing custom software routines, and figure out how all the rules should be connected in relation to each other to perform the task. As you can imagine, this can be tedious and require lots of trial and error. And if the data changes to incorporate different types of objects or environments, then it’s back to the drawing board. All of this results in tons of source code to write, debug and maintain.
[next]
In contrast, when you use the deep learning approach, the neural network model learns the rules for performing the task directly from the data. No hand-written custom feature extractors are required. You simply feed the deep neural network thousands of examples, which serve as the “experience” from which it learns how to perform the task automatically.
The advantages of using deep learning include the ability to extend and adapt to new data simply by retraining the network, the immense performance improvements from using NVIDIA GPU accelerators, and opening up the opportunity for more people to develop AI applications
As a result, the deep learning approach can be more accurate, with significantly less human effort.
It’s worth noting that some people who are comfortable using previous approaches to machine learning can find it challenging to apply deep learning, since many of the instincts and assumptions from their own hard-won experience need to change in order to effectively develop deep learning applications.
And deep learning works not only for computer vision tasks such as image classification, object detection, and image segmentation… But also for non-visual tasks such as fraud detection, speech recognition, behavior prediction, and product recommendations.
It can be helpful to think about deep learning as a way to map samples from an input domain to an output domain.
The input domain can be text data such as log files or financial data, images of pretty much anything, audio or other signals, or video streams (which are really just a sequence of images with synchronized audio). It can even be three-dimensional images or datasets collected from medical imaging devices, geophysical analyses, cosmological models, or molecular dynamics simulations.
The output domain is determined by the question you want to ask of the input data, and the question itself indicates type of deep learning task you need to perform in order to map the input domain to the output domain.
This table shows a sampling of use cases where deep learning can be applied in healthcare.
For example:
If the question requires a Yes/No answer telling you whether something is present, the task is Detection.
If the question requires an answer describing what types of things are in each input, the task is Classification.
If the question requires a shape or volume as an answer, the task is Segmentation.
And so on…
Depending on the application, you may need to use a combination tasks to achieve more sophisticated outputs.
For example, to automatically label all the faces in your family photos you’d need to first detect whether and where there are faces in the picture, and then apply facial recognition (which is a form of classification) to determine the name of the person associated with each face.
And for automatic language translation you could use speech-to-text (classification) followed by translation of text in one language to text in another language (prediction) and then speech synthesis (prediction).
Check out the next module in this series to learn how these tasks are actually built using input data and deep learning fraeworks to train deep neural networks.
======== vette this later ========
Another example you may have experienced yourself is the feature in Google Maps where the images captured in Street View are used to describe the type of business (classification), find the little sign near the door (detection), and then automatically read the sign (classification+prediction) so listing days and times the business is open can be published online.
===========================
There are a wide range of GPU-accelerated platforms you can use to accelerate deep learning training and inference application workloads.
If you want a fully-integrated solution, we recommend the DGX-1 supercomputer in a box which delivers the performance equivalent of 100s of CPU-only servers using 8 world-class Tesla GPUs, or its little brother the DGX Station, which is powered by 4 Tesla GPUs and runs whisper-quiet next to your desk.
If you just want to get started on a prototype using your existing workstation, the TitanV includes new TensorCores designed specifically for deep learning that deliver up to 12x higher peak TFLOPs for training.
In the datacenter, the Tesla P100 and V100 with NVLink Technology deliver strong scaling support for mixed workloads across both HPC applications and Deep Learning training & inference applications.
And for scale-out inference workloads the Tesla P4 (GP104) supports high efficiency (perf/watt) low-latency performance.
And, of course, if you need to deploy deep learning applications in automotive or embedded applications, NVIDIA offers the DrivePX and Jetson platforms.
======================
NVIDIA also provides a wide range of GPU-accelerated platforms you can use to accelerate deep learning training and inference application workloads.
If you want a fully-integrated solution, we recommend the DGX-1 supercomputer in a box which delivers the performance equivalent of 250 CPU-only servers, or it’s little brother the DGX Station, which runs whisper-quiet next to your desk.
If you just want to get started on a prototype using your existing workstation, the Titan X Pascal supports fast 32-bit floating point (FP32) and 8-bit integer (INT8) performance for deep learning applications.
In the datacenter, the Tesla P100 and V100 with NVLink Technology deliver strong scaling support for mixed workloads across both HPC applications and Deep Learning training & inference workloads (using FP64, FP32, and FP16).
And for scale-out inference workloads the Tesla P4 (GP104) supports high efficiency (perf/watt) low-latency performance with fast FP32 and INT8.
And, of course, if you need to deploy deep learning applications in automotive or embedded applications, NVIDIA offers the DrivePX and Jetson platforms.
======================
What happens after development. NGC is tuned for all these platforms, but what happens next is you productionize your hard work, you created this awesome AI solution, there is a seamless deployment path to a cloud based microservices in the form of the NGC TensorRT container, or you can take it to the the embedded devices… you want to productize your research or your solution, this is the path. Take these models into Jetpack and Driveworks for robotics, drones, autonomous vehicles, etc.
Key Points –
The C4140 is made possible with superior system engineering from Dell EMC and with best-of-the-breed technologies from our strategic partners – Intel providing the latest Xeon Scalable Processors and NVIDIA providing the latest TESLA GPUs
While the C4140 is designed for complex technical and cognitive computing workloads, it is targeted for the following markets :
AI / DL / ML / HPC
Life Sciences
Financial Services
Oil and Gas Exploration
Some of the capabilities include:
A no-compromise system design that provides a superior speeds of up to 500 TFLOPS/U
A patent-pending interleaved GPU enables ultra-density with unthrottled performance
Future-proofing the server platform for next-gen GPUs by supporting 2.4 KW PSUs
Will be a critical component of Ready Bundles for ML/DL
Systems Management – iDRAC9, connection view, System lockdown, OpenManage Power Center
Other 14G Features & Benefits – Systems management, Security and Intel performance boost
Key Points –
Highlight how C4140 is better with latest technology from NVIDIA
New Volta V100 is better than Pascal P100 for all HPC applications, ranging from 1.5X – 5X
Importance of ecosystem to drive application adoption and support
Key Points –
C4140 supports 2 key topologies – PCIe & NVLink
More at NVLink - http://www.nvidia.com/object/nvlink.html
Patent-pending Interleaved design only for the PCIe topology
NVLink is a proprietary tech from NVIDIA that allows direct GPU- GPU and peer-peer communication.
PCIe tech - GPU- GPU communication happens only through PCIe Switch
NVLink allows direct GPU – GPU communication resulting in increased performance
NVLink also has higher clock speed resulting in higher performance
Imperial College of London
“By choosing Dell PowerEdge C4130 servers, we gained the same amount of processing performance in 4U of rack space as our existing HPC solution, which runs across two full height racks.” - Dr. Peter Vincent, Department of Aeronautics
Texas Advanced Computing Center (TACC): Dell and EMC are the two strategic partners providing the technology that make up the core of Wrangler. Wrangler uses EMC's DSSD rack-scale flash technology to ensure speed and performance, enabling real-time analytics at scale. Source.
Result: Significantly accelerates data-centered science
Translational Genomics Research Institute (Tgen) develops early diagnostics, prognostics, and therapies for cancer, neurological disorders, diabetes, and other complex diseases.
Dell: Servers, storage, networking and infrastructure consulting. Source.
EMC: EMC Isilon scale-out cluster and Ocarina Networks compression and dedupe software, Source.
Result: Researchers can create more targeted treatments at least one week faster
James B. Hunt Jr. Library at North Carolina State University uses an HPC cluster to support large-scale visualization and builds collaborative learning spaces to inspire students. Dell: Servers, storage, networking, workstations. Source.
EMC: Storage technologies, including its Isilon scale-out NAS solutions, to power Hunt Library’s private cloud, virtual desktop, and virtual server infrastructure. Source.
Result: Supports virtualized infrastructure for remote access, easier desktop management, and cost savings
University of Aberdeen centralises its high-performance computing infrastructure, giving scientists the tools they need to drive innovative healthcare research.
Dell: New cluster based on Dell PowerEdge servers, QLogic and Dell Networking switches. Source.
EMC: EMC Celerra Unified Storage, EMC Centera Gen 4x2, EMC File Management Appliance, EMC Celerra Replicator, EMC Data Protection Advisor, EMC PowerPath, EMC RecoverPoint, EMC SnapView, EMC Virtual Provisioning, EMC Unisphere Management Console. Source.
Result: Scientists have resources to drive groundbreaking healthcare research
Max Planck Institute of Molecular Cell Biology and Genetics helps researchers better understand cell division by rendering 3D microtubule data. Dell: Servers, virtualization, storage. Source.
A joint venture between the Chinese Academy of Sciences and Max Planck Gesellschaft, the Partner Institute for Computational Biology (PICB) was established in 2005 and works on the interface between biological theory and modeling. EMC: EMC Isilon NL400, EMC Isilon X200, EMC Isilon SmartPools, EMC Isilon SmartQuotas, EMC Isilon SnapshotIQ. Source.
Results: Researchers gain 3D representations in minutes rather than hours, Increased data processing speeds by 10x
Key Takeaway – Dell EMC PowerEdge is the market leader across all types of form factors and has industry leading performance across multiple verticals.