MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko

OPENCV
OPENCL™ ACCELERATED COMPUTER VISION

 OpenCV Introduction
Andrey Pavlenko, Itseez

 Heterogeneous Compute and OpenCV
Dr. Harris Gasparakis, AMD

 OpenCV 3.0
Vadim Pisarevsky, Itseez

OpenCV introduction
Andrey Pavlenko
1.

Features

2.

History

3.

Development Process

4.

Performance

Open-source Computer Vision Library
1. 2,500+ algorithms and functions
2. Cross-platform

3. Liberal BSD license
4. High performance
5. Professionally developed
6. 7M+ downloads

Functionality overview
Image Processing

Filters

Transformations

Edges, contours

Robust features

Segmentation

Video, Stereo, 3D

Calibration

Pose estimation

Optical Flow

Detection and
recognition

Depth

Industrial applications
• Street View Panorama, etc. (Google)
• Vision system of the PR2 robot (Willow Garage)
• Robots for Mars exploration (NASA)
• Quality control of the production of coins (China)

OpenCV History
Popularity

Contributors

Core team
2000
First
public
release

2008

2009
v2.0
C++ API

2012
@github

2013
v2.4.3,
opencl

present

Contribution/patch workflow:
see OpenCV wiki

OpenCV infrastructure
build.opencv.org: buildbot with 50+ builders

50+ builds nightly!

github.com/itseez/opencv
pullrequest.opencv.org

Every patch to OpenCV
must pass 7 builders!

OpenCV resources
1. Home: opencv.org
2. Docs and tutorials: docs.opencv.org
3. Q&A forum: answers.opencv.org
4. Wiki and issues: code.opencv.org
5. Develop: https://github.com/Itseez/opencv

6. Packages: sourceforge.net/projects/opencvlibrary/

OpenCL™ in OpenCV 2.4
• ‘ocl’ is a separate module (cv::ocl::resize())
• runs on various OpenCL-compliant devices and OSes
• 2.4.7 release on November 6
–
–
–
–
–
–
–
–

official Windows bin pack with OpenCL enabled
OpenCV pre-commit check includes OpenCL tests
200+ pull requests since 2.4.6 (most actively developed OpenCV part)
dynamic OpenCL runtime loading
set default OpenCL device via environment variable
~800 optimized kernels, ~30% of most commonly used functionality
8000+ accuracy and ~500 performance tests
can be built without OpenCL SDK installed

OpenCL™ performance in OpenCV 2.4
AMD A10-6800k (with HD8670D) + Radeon HD7790

HETEROGENEOUS COMPUTE AND OPENCV
 The OpenCL™ Module in OpenCV
 Heterogeneous compute and Computer Vision
 Compute paths and data representations

 Future roadmap: transparent API

12 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL

OPENCV’S OPENCL™ MODULE
 Enables taking advantage of OpenCL™ acceleration, but currently it is an explicit path a developer can
choose to call. All OpenCL memory buffer types are supported, but not automatically optimized.
‒ But stay tuned for OpenCV 3.0’s transparent API.

 Initial release: OpenCV 2.4.3 [11/2012]
 Currently ~800 kernels
‒ Image processing
‒ Pixel-wise operations
‒ Geometric transforms
‒ Pixel transforms: filtering, edges, corners etc

‒ Feature detection and matching
‒ SURF, HOG, Haar, brute matching, kNN. templateMatching

‒ Object recognition
‒ SVM: Support Vector Machine

 Applications, including:
‒ Face Detect
‒ Optical flow
‒ Stereo Matching


COMPILING FROM SOURCE

 OpenCL™ is enabled by default in CMAKE


COMPILING FROM SOURCE
BROWSE/BUILD CODE IN AN IDE

OpenCL™ module (2.4.x).
Rebuild it even if you just
change a kernel
OpenCL kernels. Those are
converted to kernels.cpp by a
script (hence you need to
rebuild if you change a kernel).

OpenCL samples. After you
build them, go to
[ROOT]bin[CONFIG],,
observe: ocl-example-*.exe


INCORPORATING OPENCV INTO YOUR OWN CODE
 APP SDK provides 3 examples. Very easy integration!
 With less than 15 lines of code you can have a minimal program that reads video frames,
passes them to the OpenCL™ device, and runs your own simple kernel! OpenCV-CL:
‒ takes care of all OpenCL plumbing
‒ Compiles the kernels, and even caches them at runtime, and saves the OpenCL binaries on disk [user can also
modify default behavior]
‒ Allows specifying an OpenCL device/platform via environment variable.
‒ Allows plugging your own kernels to OpenCV-CL, using the OpenCV-CL data-structures.


INCORPORATING OPENCV INTO YOUR OWN CODE
SOME CODE, FROM APP SDK 2.9, GESTURE SAMPLE, SHOWCASING OPENNI® INTEGRATION

cv::Mat depthImgClamp = cv::Mat( SIZEY, SIZEX, CV_8UC1, openniBuffer);
cv::ocl::oclMat oclDepthImgClamp(depthImgClamp );

In one line, populate an image
in GPU!

vector<pair<size_t, const void *> > args;
args.push_back(make_pair(sizeof(cl_mem), (void *)&src.data));
args.push_back(make_pair(sizeof(cl_mem), (void *)&oclDst.data));

openCLExecuteKernelInterop (oclDst.clCxt, &depthConvertSrcStr, "convertDepthToWorldCoordinates",

globalThreads, localThreads, args, -1, -1, "",
false, false, true); }

In one command, add your own
kernel launch, acting on
OpenCV-CL data-structures


HETEROGENEOUS COMPUTE AND COMPUTER VISION
Webcams
everywhere

Heterogeneous
compute everywhere

Real time
computer vision
everywhere
 Heterogeneous compute mission: To take optimal advantage of the full capabilities of the underlying
platform.
‒ APU / HSA APU
‒ Discrete GPU
‒ CPU
‒ FPGA, DSP, etc.

Many code paths?
- Possibly interleaving execution between different devices

Many data representations?

DATA REPRESENTATIONS
DISCRETE

APUS, OPENCL™ 1.2

 Copy data to/from
GPU

 Use “device Memory” for data that is used
between GPU kernels

 Map/unmap using
pinned memory

‒ True for all generations. Special memory that
can be read and written fast by GPU.
‒ On APUs: physically part of main memory,
possibly with special paths.
‒ But: device memory cannot be read/written
very fast from CPU.

 Zero copy (map/unmap): best path for
data written(read) by CPU(GPU) or vice
versa.
 Cannot mix and match (bounce back and
forth between) CPU and GPU well.
 Small kernels are typically a bad idea


H1’14: APUS, OPENCL™ 1.2 + HSA extensions OR OPENCL 2.0

 Can still use “device Memory” for data that is used between GPU kernels, and zero copy still
available.
 However: SVM (shared virtual memory) can be written to/read from both CPU and GPU fast
“enough”
‒ Enables ping/pong (producer/consumer) between CPU and GPU
‒ Enables concurrent producer/consumer between CPU/GPU (platform atomics)
‒ Much easier to port a vision pipeline using HSA. You can incrementally pick and choose what part of
the pipeline to accelerate, and what part to allow the CPU to execute.
‒ On HSA APUs, using SVM is reasonable (and better) than current defaults., significantly simplifying
code.

 User mode enqueueing: much faster kernel dispatching leads to less performance
degradation of small kernels. Can feed the GPU smaller computational tasks fast, and (busy)
wait for results on the CPU.


COMPUTE PATHS
OpenCV 2.4.x: Face detect on CPU

// initialization
VideoCapture vcap(...);
CascadeClassifier fd("haar_ff.xml");

Removed image

Mat frame, frameGray;

demonstrating face detect

vector<Rect> faces;
for(;;){
// processing loop

vcap >> frame;
cvtColor(frame, frameGray, BGR2GRAY);
equalizeHist(frameGray, frameGray);

fd.detectMultiScale(frameGray, faces, ...);
// draw rectangles …
// show image …
}

COMPUTE PATHS
OpenCV 2.4.x: Face detect with OpenCL™
// initialization

ocl::OclCascadeClassifier fd("haar_ff.xml");

Removed image

ocl::oclMat frame, frameGray;

demonstrating face detect

Mat frameCpu;

vector<Rect> faces;
for(;;){
// processing loop
vcap >> frameCpu;
frame = frameCpu;
ocl:: cvtColor(frame, frameGray, BGR2GRAY);
ocl:: equalizeHist(frameGray, frameGray);
ocl:: fd.detectMultiScale(frameGray, faces, ...);
// show image …


FUTURE ROADMAP
‒ Incorporate OpenCL™ 1.2 with HSA extensions, and OpenCL 2.0
‒ Shared Virtual Memory (SVM) significantly simplifies programming model in general. Allows reusing existing memory as SVM.
‒ In SVM, a “pointer is a pointer”
‒ Pass your tree/linked list/graph data structure in the GPU, have threads explore sub-branches, or explore paths on a graph

‒ Transparent API:
‒
‒
‒
‒

One code path, OpenCV will choose the best execution path at runtime, given the platform.
Changes of data locality should be implemented by the framework.
Includes applying heuristics appropriate for underlying hardware (dGPU, APU, HSA APU).
Eventually it should be self-optimizing
‒ reasonably define optimal memory type “under the hood.”
‒ Detect data flow dependencies, in the pipeline, and automatically represent them as OpenCL events.

Starting with OpenCV 3.0.


DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices,
Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names
are for informational purposes only and may be trademarks of their respective owners.

OpenCV 3.0
Vadim Pisarevsky
1.

Transparent API

2.

UMat

3.

Under the hood

OpenCV 3.0
• OpenCV 3.0 is scheduled for 2014’Q1
• Based on 2.x, but:
– transparent API and more efficient and platform-specific

OpenCL™ codepaths (including better zero-copy and SVM support)
– API cleanup
– a lot of new algorithms

Transparent API
• same code can run on CPU or GPU

– no specialized cv::ocl::Canny vs cv::Canny
– no recompilation is needed

• includes the following key components:
– new data structure UMat (Universal Mat)
–
–

simple and robust mechanism for async processing
convenient API for custom algorithm implementation

• minimal or no changes in the existing code
–

CPU-only processing – no changes required

UMat

• Mat=>UMat is the only change needed
• Sometimes, somewhere (HSA) it’s not needed either!
// initialization
CascadeClassifier fd("haar_ff.xml");
UMat frame, frameGray;
vector<Rect> faces;
for(;;){
// processing loop
vcap >> frame;
cvtColor(frame, frameGray, BGR2GRAY);
equalizeHist(frameGray, frameGray);
fd.detectMultiScale(frameGray, faces, ...);
// show image …
}

Transparent API: under the hood
bool _ocl_cvtColor(InputArray src, OutputArray dst, int code) {
static ocl::ProgramSource oclsrc(“//cvtcolor.cl source coden …”);
UMat src_ocl = src.getUMat(), dst_ocl = dst.getUMat();
if (code == COLOR_BGR2GRAY) {
// get the kernel; kernel is compiled only once and cached
ocl::Kernel kernel(“bgr2gray”, oclsrc, <compile_flags>);
// pass 2 arrays to the kernel and run it
return kernel.args(src, dst).run(0, 0, false);
} else if(code == COLOR_BGR2YUV) { … }
return false;
}
void _cpu_cvtColor(const Mat& src, Mat& dst, int code) { … }
// transparent API dispatcher function
void cvtColor(InputArray src, OutputArray dst, int code) {
dst.create(src.size(), …);
if (useOpenCL(src, dst) && _ocl_cvtColor(src, dst, code)) return;
// getMat() uses zero-copy if available; and with SVM it’s no op
Mat src_cpu = src.getMat();
Mat dst_cpu = dst.getMat();
_cpu_cvtColor(src_cpu, dst_cpu, code);

OpenCV+OpenCL™ execution model
CPU threads

…
cv::ocl::Queue

cv::ocl::Queue

cv::ocl::Device

…

…

cv::ocl::Queue

cv::ocl::Device

• One OpenCL queue and one OpenCL device per CPU thread
• OpenCL kernels are executed asynchronously
• cv::ocl::finish() puts the barrier in the current CPU thread;
.getMat() automatically calls it.

Summary & Future directions
• OpenCL™ is a great tool to boost performance of vision
algorithms; OpenCV unleashes its potential to CV
community
• OpenCV 3.0 transparent API makes it even easier and …
more transparent
• possible directions: pipelines, memory allocation
optimization, more algorithms ported to OpenCL

MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko

Similaire à MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko (20)

Plus de AMD Developer Central

Plus de AMD Developer Central (20)

Dernier

Dernier (20)

MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko