SlideShare une entreprise Scribd logo
1  sur  18
CUDA Programming
continued
ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson
Revised
2
Timing GPU Execution
Can use CUDA “events” – create two events and compute
the time between them:
cudaEvent_t start, stop;
float elapsedTime;
cudaEventCreate(&start); // create event objects
cudaEventCreate(&stop);
cudaEventRecord(start, 0); // Record start event
.
.
.
cudaEventRecord(stop, 0); // record end event
cudaEventSynchronize(stop); // wait work preceding to complete
cudaEventRecord(stop,0)
cudaEventElapsedTime(&elapsedTime, start, stop);
//compute elapsed time between events
cudaEventDestroy(start); //destroy start event
cudaEventDestroy(stop);); //destroy stop event
3
4
5
6
7
8
Host Synchronization
Kernels
•
Control returned to CPU immediately (asynchronous,
non-blocking)
•
Kernel starts after all previous CUDA calls completed
cudaMemcpy
•
Returns after copy complete (synchronous)
•
Copy starts after all previous CUDA calls completed
9
CUDA Synchronization Routines
Host
cudaThreadSynchronize()
•
Blocks until all previous CUDA calls complete
GPU
void __syncthreads()
•
Synchronizes all threads in a block
•
Barrier – no thread can pass until all threads in block
reach it.
•
All threads must reach __syncthread in thread block.
10
GPU Atomic Operations
Performs a read-modify-write atomic operation on one
word residing in global or shared memory.
Associative operations on signed/unsigned integers,
add, sub, min, max, and, or, xor, increment,
decrement, exchange, compare and swap.
Requires GPU with compute capability 1.1+
(Shared memory operations and 64-bit words require higher capability)
coit-grid06 Tesla C2050 has compute capability 2.0
See http://www.nvidia.com/object/cuda_gpus.html for GPU compute capabilities
11
int atomicAdd(int* address, int val);
reads old located at address address in global or shared
memory, computes (old + val), and stores result back to
memory at same address.
These three operations (read, compute, and write) are
performed in one atomic transaction.*
Function returns old.
Atomic Operation Example
* Once stated, it continues to completion without being able to be interrupted by other
processors. Other processors cannot read or write to memory location once atomic
operation starts. Mechanism implemented in hardware.
12
Other operations
int atomicSub(int* address, int val);
int atomicExch(int* address, int val);
int atomicMin(int* address, int val);
int atomicMax(int* address, int val);
unsigned int atomicInc(unsigned int* address, unsigned int val);
unsigned int atomicDec(unsigned int* address, unsigned int val);
int atomicCAS(int* address, int compare, int val); //compare and swap
int atomicAnd(int* address, int val);
int atomicOr(int* address, int val);
int atomicXor(int* address, int val);
Source: NVIDIA CUDA C Programming Guide, version 3.2, 11/9/2010
13
int atomicCAS(int* address, int compare, int val);
reads the word old located at address address in global or shared
memory, and compares old with compare. If they are the same, it
set old to val (stores val at address address), i.e.:
if (old == compare) old = val; // else old = old
The three operations (read, compute, and write) are performed in
one atomic transaction.
The function returns the original value of old.
Also unsigned and unsigned long long int versions.
Compare and Swap
(also called compare and exchange)
14
__device__ int lock=0; // unlocked
__global__ void kernel(...) {
...
do {} while(atomicCAS(&lock,0,1)); // if lock = 0 set to1
// and continue
... // critical section
lock = 0; // free lock
}
Coding Critical Sections with
Locks
15
Memory Fences
Threads may see the effects of a series of writes to memory
executed by another thread in different orders. To enforce ordering:
void __threadfence_block();
waits until all global and shared memory accesses made by the
calling thread prior to __threadfence_block() are visible to all
threads in the thread block.
Other routines:
void __threadfence();
void __threadfence_system();
16
Writes to device memory not guaranteed in any order,
so global writes may not have completed by the time the
lock is unlocked
__global__ void kernel(...) {
...
do {} while(atomicCAS(&lock,0,1));
... // criticial section
__threadfence(); // wait for writes to finish
lock = 0;
}
Critical sections with memory
operations
17
Error reporting
All CUDA calls (except kernel launches) return error code of type
cudaError_t
cudaError_t cudaGetLastError(void)
Returns code for the last error
Can be used to get error from kernel execution.
Char* cudaGetErroprString(cudaError_t code)
Returns a null-terminated character string describing error
Example
print(“%sn”,cudaGetErrorString(cudaGetLastError());
Questions

Contenu connexe

Tendances

Systemd evolution revolution_regression
Systemd evolution revolution_regressionSystemd evolution revolution_regression
Systemd evolution revolution_regressionSusant Sahani
 
Systemd mlug-20140614
Systemd mlug-20140614Systemd mlug-20140614
Systemd mlug-20140614Susant Sahani
 
Effective java - concurrency
Effective java - concurrencyEffective java - concurrency
Effective java - concurrencyfeng lee
 
Effective service and resource management with systemd
Effective service and resource management with systemdEffective service and resource management with systemd
Effective service and resource management with systemdDavid Timothy Strauss
 
Actor Concurrency
Actor ConcurrencyActor Concurrency
Actor ConcurrencyAlex Miller
 
Java Concurrency Gotchas
Java Concurrency GotchasJava Concurrency Gotchas
Java Concurrency GotchasAlex Miller
 
FreeRTOS Xilinx Vivado: Hello World!
FreeRTOS Xilinx Vivado: Hello World!FreeRTOS Xilinx Vivado: Hello World!
FreeRTOS Xilinx Vivado: Hello World!Vincent Claes
 
Locks (Concurrency)
Locks (Concurrency)Locks (Concurrency)
Locks (Concurrency)Sri Prasanna
 
Introduction to systemd
Introduction to systemdIntroduction to systemd
Introduction to systemdYusaku OGAWA
 
Locally run a FIWARE Lab Instance In another Hypervisors
Locally run a FIWARE Lab Instance In another HypervisorsLocally run a FIWARE Lab Instance In another Hypervisors
Locally run a FIWARE Lab Instance In another HypervisorsJosé Ignacio Carretero Guarde
 
Scale11x lxc talk
Scale11x lxc talkScale11x lxc talk
Scale11x lxc talkdotCloud
 
Java Core | Understanding the Disruptor: a Beginner's Guide to Hardcore Concu...
Java Core | Understanding the Disruptor: a Beginner's Guide to Hardcore Concu...Java Core | Understanding the Disruptor: a Beginner's Guide to Hardcore Concu...
Java Core | Understanding the Disruptor: a Beginner's Guide to Hardcore Concu...JAX London
 
Puppet Camp Chicago 2014: Docker and Puppet: 1+1=3 (Intermediate)
Puppet Camp Chicago 2014: Docker and Puppet: 1+1=3 (Intermediate)Puppet Camp Chicago 2014: Docker and Puppet: 1+1=3 (Intermediate)
Puppet Camp Chicago 2014: Docker and Puppet: 1+1=3 (Intermediate)Puppet
 
Memory Management of C# with Unity Native Collections
Memory Management of C# with Unity Native CollectionsMemory Management of C# with Unity Native Collections
Memory Management of C# with Unity Native CollectionsYoshifumi Kawai
 
Let'swift "Concurrency in swift"
Let'swift "Concurrency in swift"Let'swift "Concurrency in swift"
Let'swift "Concurrency in swift"Hyuk Hur
 
Java util concurrent
Java util concurrentJava util concurrent
Java util concurrentRoger Xia
 

Tendances (20)

Systemd evolution revolution_regression
Systemd evolution revolution_regressionSystemd evolution revolution_regression
Systemd evolution revolution_regression
 
Systemd mlug-20140614
Systemd mlug-20140614Systemd mlug-20140614
Systemd mlug-20140614
 
Disruptor
DisruptorDisruptor
Disruptor
 
Effective java - concurrency
Effective java - concurrencyEffective java - concurrency
Effective java - concurrency
 
Effective service and resource management with systemd
Effective service and resource management with systemdEffective service and resource management with systemd
Effective service and resource management with systemd
 
Actor Concurrency
Actor ConcurrencyActor Concurrency
Actor Concurrency
 
Java Concurrency Gotchas
Java Concurrency GotchasJava Concurrency Gotchas
Java Concurrency Gotchas
 
FreeRTOS Xilinx Vivado: Hello World!
FreeRTOS Xilinx Vivado: Hello World!FreeRTOS Xilinx Vivado: Hello World!
FreeRTOS Xilinx Vivado: Hello World!
 
Locks (Concurrency)
Locks (Concurrency)Locks (Concurrency)
Locks (Concurrency)
 
Introduction to systemd
Introduction to systemdIntroduction to systemd
Introduction to systemd
 
Locally run a FIWARE Lab Instance In another Hypervisors
Locally run a FIWARE Lab Instance In another HypervisorsLocally run a FIWARE Lab Instance In another Hypervisors
Locally run a FIWARE Lab Instance In another Hypervisors
 
Scale11x lxc talk
Scale11x lxc talkScale11x lxc talk
Scale11x lxc talk
 
Basic of Systemd
Basic of SystemdBasic of Systemd
Basic of Systemd
 
Node day 2014
Node day 2014Node day 2014
Node day 2014
 
Java Core | Understanding the Disruptor: a Beginner's Guide to Hardcore Concu...
Java Core | Understanding the Disruptor: a Beginner's Guide to Hardcore Concu...Java Core | Understanding the Disruptor: a Beginner's Guide to Hardcore Concu...
Java Core | Understanding the Disruptor: a Beginner's Guide to Hardcore Concu...
 
Puppet Camp Chicago 2014: Docker and Puppet: 1+1=3 (Intermediate)
Puppet Camp Chicago 2014: Docker and Puppet: 1+1=3 (Intermediate)Puppet Camp Chicago 2014: Docker and Puppet: 1+1=3 (Intermediate)
Puppet Camp Chicago 2014: Docker and Puppet: 1+1=3 (Intermediate)
 
Memory Management of C# with Unity Native Collections
Memory Management of C# with Unity Native CollectionsMemory Management of C# with Unity Native Collections
Memory Management of C# with Unity Native Collections
 
Let'swift "Concurrency in swift"
Let'swift "Concurrency in swift"Let'swift "Concurrency in swift"
Let'swift "Concurrency in swift"
 
systemd
systemdsystemd
systemd
 
Java util concurrent
Java util concurrentJava util concurrent
Java util concurrent
 

Similaire à Cuda 2

OS scheduling and The anatomy of a context switch
OS scheduling and The anatomy of a context switchOS scheduling and The anatomy of a context switch
OS scheduling and The anatomy of a context switchDaniel Ben-Zvi
 
Java Concurrency in Practice
Java Concurrency in PracticeJava Concurrency in Practice
Java Concurrency in PracticeAlina Dolgikh
 
LCA13: Common Clk Framework DVFS Roadmap
LCA13: Common Clk Framework DVFS RoadmapLCA13: Common Clk Framework DVFS Roadmap
LCA13: Common Clk Framework DVFS RoadmapLinaro
 
CUDA Deep Dive
CUDA Deep DiveCUDA Deep Dive
CUDA Deep Divekrasul
 
.NET Multithreading/Multitasking
.NET Multithreading/Multitasking.NET Multithreading/Multitasking
.NET Multithreading/MultitaskingSasha Kravchuk
 
Concurrency Learning From Jdk Source
Concurrency Learning From Jdk SourceConcurrency Learning From Jdk Source
Concurrency Learning From Jdk SourceKaniska Mandal
 
Let's Talk Locks!
Let's Talk Locks!Let's Talk Locks!
Let's Talk Locks!C4Media
 
Android Loaders : Reloaded
Android Loaders : ReloadedAndroid Loaders : Reloaded
Android Loaders : Reloadedcbeyls
 
CUDA by Example : Streams : Notes
CUDA by Example : Streams : NotesCUDA by Example : Streams : Notes
CUDA by Example : Streams : NotesSubhajit Sahu
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debuggingJungMinSEO5
 
Parallel Programming With Dot Net
Parallel Programming With Dot NetParallel Programming With Dot Net
Parallel Programming With Dot NetNeeraj Kaushik
 
Node.js Event Loop & EventEmitter
Node.js Event Loop & EventEmitterNode.js Event Loop & EventEmitter
Node.js Event Loop & EventEmitterSimen Li
 
NodeJSnodesforfreeinmyworldgipsnndnnd.pdf
NodeJSnodesforfreeinmyworldgipsnndnnd.pdfNodeJSnodesforfreeinmyworldgipsnndnnd.pdf
NodeJSnodesforfreeinmyworldgipsnndnnd.pdfVivekSonawane45
 
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...PROIDEA
 
CoreOS, or How I Learned to Stop Worrying and Love Systemd
CoreOS, or How I Learned to Stop Worrying and Love SystemdCoreOS, or How I Learned to Stop Worrying and Love Systemd
CoreOS, or How I Learned to Stop Worrying and Love SystemdRichard Lister
 

Similaire à Cuda 2 (20)

Node js lecture
Node js lectureNode js lecture
Node js lecture
 
OS scheduling and The anatomy of a context switch
OS scheduling and The anatomy of a context switchOS scheduling and The anatomy of a context switch
OS scheduling and The anatomy of a context switch
 
Microkernel Development
Microkernel DevelopmentMicrokernel Development
Microkernel Development
 
Java Concurrency in Practice
Java Concurrency in PracticeJava Concurrency in Practice
Java Concurrency in Practice
 
LCA13: Common Clk Framework DVFS Roadmap
LCA13: Common Clk Framework DVFS RoadmapLCA13: Common Clk Framework DVFS Roadmap
LCA13: Common Clk Framework DVFS Roadmap
 
Sysprog 14
Sysprog 14Sysprog 14
Sysprog 14
 
CUDA Deep Dive
CUDA Deep DiveCUDA Deep Dive
CUDA Deep Dive
 
.NET Multithreading/Multitasking
.NET Multithreading/Multitasking.NET Multithreading/Multitasking
.NET Multithreading/Multitasking
 
Concurrency Learning From Jdk Source
Concurrency Learning From Jdk SourceConcurrency Learning From Jdk Source
Concurrency Learning From Jdk Source
 
Let's Talk Locks!
Let's Talk Locks!Let's Talk Locks!
Let's Talk Locks!
 
Android Loaders : Reloaded
Android Loaders : ReloadedAndroid Loaders : Reloaded
Android Loaders : Reloaded
 
Operating System Assignment Help
Operating System Assignment HelpOperating System Assignment Help
Operating System Assignment Help
 
CUDA by Example : Streams : Notes
CUDA by Example : Streams : NotesCUDA by Example : Streams : Notes
CUDA by Example : Streams : Notes
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
Parallel Programming With Dot Net
Parallel Programming With Dot NetParallel Programming With Dot Net
Parallel Programming With Dot Net
 
Java Concurrency
Java ConcurrencyJava Concurrency
Java Concurrency
 
Node.js Event Loop & EventEmitter
Node.js Event Loop & EventEmitterNode.js Event Loop & EventEmitter
Node.js Event Loop & EventEmitter
 
NodeJSnodesforfreeinmyworldgipsnndnnd.pdf
NodeJSnodesforfreeinmyworldgipsnndnnd.pdfNodeJSnodesforfreeinmyworldgipsnndnnd.pdf
NodeJSnodesforfreeinmyworldgipsnndnnd.pdf
 
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
 
CoreOS, or How I Learned to Stop Worrying and Love Systemd
CoreOS, or How I Learned to Stop Worrying and Love SystemdCoreOS, or How I Learned to Stop Worrying and Love Systemd
CoreOS, or How I Learned to Stop Worrying and Love Systemd
 

Plus de Anshul Sharma (12)

Understanding concurrency
Understanding concurrencyUnderstanding concurrency
Understanding concurrency
 
Interm codegen
Interm codegenInterm codegen
Interm codegen
 
Programming using Open Mp
Programming using Open MpProgramming using Open Mp
Programming using Open Mp
 
Open MPI 2
Open MPI 2Open MPI 2
Open MPI 2
 
Open MPI
Open MPIOpen MPI
Open MPI
 
Paralle programming 2
Paralle programming 2Paralle programming 2
Paralle programming 2
 
Parallel programming
Parallel programmingParallel programming
Parallel programming
 
Cuda 3
Cuda 3Cuda 3
Cuda 3
 
Cuda intro
Cuda introCuda intro
Cuda intro
 
Des
DesDes
Des
 
Intoduction to Linux
Intoduction to LinuxIntoduction to Linux
Intoduction to Linux
 
GCC
GCCGCC
GCC
 

Dernier

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 

Dernier (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 

Cuda 2

  • 1. CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised
  • 2. 2 Timing GPU Execution Can use CUDA “events” – create two events and compute the time between them: cudaEvent_t start, stop; float elapsedTime; cudaEventCreate(&start); // create event objects cudaEventCreate(&stop); cudaEventRecord(start, 0); // Record start event . . . cudaEventRecord(stop, 0); // record end event cudaEventSynchronize(stop); // wait work preceding to complete cudaEventRecord(stop,0) cudaEventElapsedTime(&elapsedTime, start, stop); //compute elapsed time between events cudaEventDestroy(start); //destroy start event cudaEventDestroy(stop);); //destroy stop event
  • 3. 3
  • 4. 4
  • 5. 5
  • 6. 6
  • 7. 7
  • 8. 8 Host Synchronization Kernels • Control returned to CPU immediately (asynchronous, non-blocking) • Kernel starts after all previous CUDA calls completed cudaMemcpy • Returns after copy complete (synchronous) • Copy starts after all previous CUDA calls completed
  • 9. 9 CUDA Synchronization Routines Host cudaThreadSynchronize() • Blocks until all previous CUDA calls complete GPU void __syncthreads() • Synchronizes all threads in a block • Barrier – no thread can pass until all threads in block reach it. • All threads must reach __syncthread in thread block.
  • 10. 10 GPU Atomic Operations Performs a read-modify-write atomic operation on one word residing in global or shared memory. Associative operations on signed/unsigned integers, add, sub, min, max, and, or, xor, increment, decrement, exchange, compare and swap. Requires GPU with compute capability 1.1+ (Shared memory operations and 64-bit words require higher capability) coit-grid06 Tesla C2050 has compute capability 2.0 See http://www.nvidia.com/object/cuda_gpus.html for GPU compute capabilities
  • 11. 11 int atomicAdd(int* address, int val); reads old located at address address in global or shared memory, computes (old + val), and stores result back to memory at same address. These three operations (read, compute, and write) are performed in one atomic transaction.* Function returns old. Atomic Operation Example * Once stated, it continues to completion without being able to be interrupted by other processors. Other processors cannot read or write to memory location once atomic operation starts. Mechanism implemented in hardware.
  • 12. 12 Other operations int atomicSub(int* address, int val); int atomicExch(int* address, int val); int atomicMin(int* address, int val); int atomicMax(int* address, int val); unsigned int atomicInc(unsigned int* address, unsigned int val); unsigned int atomicDec(unsigned int* address, unsigned int val); int atomicCAS(int* address, int compare, int val); //compare and swap int atomicAnd(int* address, int val); int atomicOr(int* address, int val); int atomicXor(int* address, int val); Source: NVIDIA CUDA C Programming Guide, version 3.2, 11/9/2010
  • 13. 13 int atomicCAS(int* address, int compare, int val); reads the word old located at address address in global or shared memory, and compares old with compare. If they are the same, it set old to val (stores val at address address), i.e.: if (old == compare) old = val; // else old = old The three operations (read, compute, and write) are performed in one atomic transaction. The function returns the original value of old. Also unsigned and unsigned long long int versions. Compare and Swap (also called compare and exchange)
  • 14. 14 __device__ int lock=0; // unlocked __global__ void kernel(...) { ... do {} while(atomicCAS(&lock,0,1)); // if lock = 0 set to1 // and continue ... // critical section lock = 0; // free lock } Coding Critical Sections with Locks
  • 15. 15 Memory Fences Threads may see the effects of a series of writes to memory executed by another thread in different orders. To enforce ordering: void __threadfence_block(); waits until all global and shared memory accesses made by the calling thread prior to __threadfence_block() are visible to all threads in the thread block. Other routines: void __threadfence(); void __threadfence_system();
  • 16. 16 Writes to device memory not guaranteed in any order, so global writes may not have completed by the time the lock is unlocked __global__ void kernel(...) { ... do {} while(atomicCAS(&lock,0,1)); ... // criticial section __threadfence(); // wait for writes to finish lock = 0; } Critical sections with memory operations
  • 17. 17 Error reporting All CUDA calls (except kernel launches) return error code of type cudaError_t cudaError_t cudaGetLastError(void) Returns code for the last error Can be used to get error from kernel execution. Char* cudaGetErroprString(cudaError_t code) Returns a null-terminated character string describing error Example print(“%sn”,cudaGetErrorString(cudaGetLastError());