6. Industrial applications
• Street View Panorama, etc. (Google)
• Vision system of the PR2 robot (Willow Garage)
• Robots for Mars exploration (NASA)
• Quality control of the production of coins (China)
8. Contribution/patch workflow:
see OpenCV wiki
OpenCV infrastructure
build.opencv.org: buildbot with 50+ builders
50+ builds nightly!
github.com/itseez/opencv
pullrequest.opencv.org
Every patch to OpenCV
must pass 7 builders!
9. OpenCV resources
1. Home: opencv.org
2. Docs and tutorials: docs.opencv.org
3. Q&A forum: answers.opencv.org
4. Wiki and issues: code.opencv.org
5. Develop: https://github.com/Itseez/opencv
6. Packages: sourceforge.net/projects/opencvlibrary/
10. OpenCL™ in OpenCV 2.4
• ‘ocl’ is a separate module (cv::ocl::resize())
• runs on various OpenCL-compliant devices and OSes
• 2.4.7 release on November 6
–
–
–
–
–
–
–
–
official Windows bin pack with OpenCL enabled
OpenCV pre-commit check includes OpenCL tests
200+ pull requests since 2.4.6 (most actively developed OpenCV part)
dynamic OpenCL runtime loading
set default OpenCL device via environment variable
~800 optimized kernels, ~30% of most commonly used functionality
8000+ accuracy and ~500 performance tests
can be built without OpenCL SDK installed
12. HETEROGENEOUS COMPUTE AND OPENCV
The OpenCL™ Module in OpenCV
Heterogeneous compute and Computer Vision
Compute paths and data representations
Future roadmap: transparent API
12 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
13. OPENCV’S OPENCL™ MODULE
Enables taking advantage of OpenCL™ acceleration, but currently it is an explicit path a developer can
choose to call. All OpenCL memory buffer types are supported, but not automatically optimized.
‒ But stay tuned for OpenCV 3.0’s transparent API.
Initial release: OpenCV 2.4.3 [11/2012]
Currently ~800 kernels
‒ Image processing
‒ Pixel-wise operations
‒ Geometric transforms
‒ Pixel transforms: filtering, edges, corners etc
‒ Feature detection and matching
‒ SURF, HOG, Haar, brute matching, kNN. templateMatching
‒ Object recognition
‒ SVM: Support Vector Machine
Applications, including:
‒ Face Detect
‒ Optical flow
‒ Stereo Matching
13 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
14. COMPILING FROM SOURCE
OpenCL™ is enabled by default in CMAKE
14 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
15. COMPILING FROM SOURCE
BROWSE/BUILD CODE IN AN IDE
OpenCL™ module (2.4.x).
Rebuild it even if you just
change a kernel
OpenCL kernels. Those are
converted to kernels.cpp by a
script (hence you need to
rebuild if you change a kernel).
OpenCL samples. After you
build them, go to
[ROOT]bin[CONFIG],,
observe: ocl-example-*.exe
15 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
16. INCORPORATING OPENCV INTO YOUR OWN CODE
APP SDK provides 3 examples. Very easy integration!
With less than 15 lines of code you can have a minimal program that reads video frames,
passes them to the OpenCL™ device, and runs your own simple kernel! OpenCV-CL:
‒ takes care of all OpenCL plumbing
‒ Compiles the kernels, and even caches them at runtime, and saves the OpenCL binaries on disk [user can also
modify default behavior]
‒ Allows specifying an OpenCL device/platform via environment variable.
‒ Allows plugging your own kernels to OpenCV-CL, using the OpenCV-CL data-structures.
16 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
17. INCORPORATING OPENCV INTO YOUR OWN CODE
SOME CODE, FROM APP SDK 2.9, GESTURE SAMPLE, SHOWCASING OPENNI® INTEGRATION
cv::Mat depthImgClamp = cv::Mat( SIZEY, SIZEX, CV_8UC1, openniBuffer);
cv::ocl::oclMat oclDepthImgClamp(depthImgClamp );
In one line, populate an image
in GPU!
vector<pair<size_t, const void *> > args;
args.push_back(make_pair(sizeof(cl_mem), (void *)&src.data));
args.push_back(make_pair(sizeof(cl_mem), (void *)&oclDst.data));
openCLExecuteKernelInterop (oclDst.clCxt, &depthConvertSrcStr, "convertDepthToWorldCoordinates",
globalThreads, localThreads, args, -1, -1, "",
false, false, true); }
In one command, add your own
kernel launch, acting on
OpenCV-CL data-structures
17 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
18. HETEROGENEOUS COMPUTE AND COMPUTER VISION
Webcams
everywhere
Heterogeneous
compute everywhere
Real time
computer vision
everywhere
Heterogeneous compute mission: To take optimal advantage of the full capabilities of the underlying
platform.
‒ APU / HSA APU
‒ Discrete GPU
‒ CPU
‒ FPGA, DSP, etc.
18 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
Many code paths?
- Possibly interleaving execution between different devices
Many data representations?
19. DATA REPRESENTATIONS
DISCRETE
APUS, OPENCL™ 1.2
Copy data to/from
GPU
Use “device Memory” for data that is used
between GPU kernels
Map/unmap using
pinned memory
‒ True for all generations. Special memory that
can be read and written fast by GPU.
‒ On APUs: physically part of main memory,
possibly with special paths.
‒ But: device memory cannot be read/written
very fast from CPU.
Zero copy (map/unmap): best path for
data written(read) by CPU(GPU) or vice
versa.
Cannot mix and match (bounce back and
forth between) CPU and GPU well.
Small kernels are typically a bad idea
19 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
20. H1’14: APUS, OPENCL™ 1.2 + HSA extensions OR OPENCL 2.0
Can still use “device Memory” for data that is used between GPU kernels, and zero copy still
available.
However: SVM (shared virtual memory) can be written to/read from both CPU and GPU fast
“enough”
‒ Enables ping/pong (producer/consumer) between CPU and GPU
‒ Enables concurrent producer/consumer between CPU/GPU (platform atomics)
‒ Much easier to port a vision pipeline using HSA. You can incrementally pick and choose what part of
the pipeline to accelerate, and what part to allow the CPU to execute.
‒ On HSA APUs, using SVM is reasonable (and better) than current defaults., significantly simplifying
code.
User mode enqueueing: much faster kernel dispatching leads to less performance
degradation of small kernels. Can feed the GPU smaller computational tasks fast, and (busy)
wait for results on the CPU.
20 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
21. COMPUTE PATHS
OpenCV 2.4.x: Face detect on CPU
// initialization
VideoCapture vcap(...);
CascadeClassifier fd("haar_ff.xml");
Removed image
Mat frame, frameGray;
demonstrating face detect
vector<Rect> faces;
for(;;){
// processing loop
vcap >> frame;
cvtColor(frame, frameGray, BGR2GRAY);
equalizeHist(frameGray, frameGray);
fd.detectMultiScale(frameGray, faces, ...);
// draw rectangles …
// show image …
}
21 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
22. COMPUTE PATHS
OpenCV 2.4.x: Face detect with OpenCL™
// initialization
VideoCapture vcap(...);
ocl::OclCascadeClassifier fd("haar_ff.xml");
Removed image
ocl::oclMat frame, frameGray;
demonstrating face detect
Mat frameCpu;
vector<Rect> faces;
for(;;){
// processing loop
vcap >> frameCpu;
frame = frameCpu;
ocl:: cvtColor(frame, frameGray, BGR2GRAY);
ocl:: equalizeHist(frameGray, frameGray);
ocl:: fd.detectMultiScale(frameGray, faces, ...);
// draw rectangles …
// show image …
22 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
23. FUTURE ROADMAP
‒ Incorporate OpenCL™ 1.2 with HSA extensions, and OpenCL 2.0
‒ Shared Virtual Memory (SVM) significantly simplifies programming model in general. Allows reusing existing memory as SVM.
‒ In SVM, a “pointer is a pointer”
‒ Pass your tree/linked list/graph data structure in the GPU, have threads explore sub-branches, or explore paths on a graph
‒ Transparent API:
‒
‒
‒
‒
One code path, OpenCV will choose the best execution path at runtime, given the platform.
Changes of data locality should be implemented by the framework.
Includes applying heuristics appropriate for underlying hardware (dGPU, APU, HSA APU).
Eventually it should be self-optimizing
‒ reasonably define optimal memory type “under the hood.”
‒ Detect data flow dependencies, in the pipeline, and automatically represent them as OpenCL events.
Starting with OpenCV 3.0.
23 OPENCV-CL | NOVEMBER 12,2013 | DR. HARRIS GASPARAKIS | CONFIDENTIAL
26. OpenCV 3.0
• OpenCV 3.0 is scheduled for 2014’Q1
• Based on 2.x, but:
– transparent API and more efficient and platform-specific
OpenCL™ codepaths (including better zero-copy and SVM support)
– API cleanup
– a lot of new algorithms
27. Transparent API
• same code can run on CPU or GPU
– no specialized cv::ocl::Canny vs cv::Canny
– no recompilation is needed
• includes the following key components:
– new data structure UMat (Universal Mat)
–
–
simple and robust mechanism for async processing
convenient API for custom algorithm implementation
• minimal or no changes in the existing code
–
CPU-only processing – no changes required
28. UMat
• Mat=>UMat is the only change needed
• Sometimes, somewhere (HSA) it’s not needed either!
// initialization
VideoCapture vcap(...);
CascadeClassifier fd("haar_ff.xml");
UMat frame, frameGray;
vector<Rect> faces;
for(;;){
// processing loop
vcap >> frame;
cvtColor(frame, frameGray, BGR2GRAY);
equalizeHist(frameGray, frameGray);
fd.detectMultiScale(frameGray, faces, ...);
// draw rectangles …
// show image …
}
29. Transparent API: under the hood
bool _ocl_cvtColor(InputArray src, OutputArray dst, int code) {
static ocl::ProgramSource oclsrc(“//cvtcolor.cl source coden …”);
UMat src_ocl = src.getUMat(), dst_ocl = dst.getUMat();
if (code == COLOR_BGR2GRAY) {
// get the kernel; kernel is compiled only once and cached
ocl::Kernel kernel(“bgr2gray”, oclsrc, <compile_flags>);
// pass 2 arrays to the kernel and run it
return kernel.args(src, dst).run(0, 0, false);
} else if(code == COLOR_BGR2YUV) { … }
return false;
}
void _cpu_cvtColor(const Mat& src, Mat& dst, int code) { … }
// transparent API dispatcher function
void cvtColor(InputArray src, OutputArray dst, int code) {
dst.create(src.size(), …);
if (useOpenCL(src, dst) && _ocl_cvtColor(src, dst, code)) return;
// getMat() uses zero-copy if available; and with SVM it’s no op
Mat src_cpu = src.getMat();
Mat dst_cpu = dst.getMat();
_cpu_cvtColor(src_cpu, dst_cpu, code);
30. OpenCV+OpenCL™ execution model
CPU threads
…
cv::ocl::Queue
cv::ocl::Queue
cv::ocl::Device
…
…
cv::ocl::Queue
cv::ocl::Device
• One OpenCL queue and one OpenCL device per CPU thread
• OpenCL kernels are executed asynchronously
• cv::ocl::finish() puts the barrier in the current CPU thread;
.getMat() automatically calls it.
31. Summary & Future directions
• OpenCL™ is a great tool to boost performance of vision
algorithms; OpenCV unleashes its potential to CV
community
• OpenCV 3.0 transparent API makes it even easier and …
more transparent
• possible directions: pipelines, memory allocation
optimization, more algorithms ported to OpenCL