SlideShare une entreprise Scribd logo
1  sur  46
1
This tutorial is part of a series called SIFT: Theory and Practice:
1. SIFT: Introduction
2. SIFT: The scale space
3. SIFT: LoG approximations
4. SIFT: Finding keypoints
5. SIFT: Getting rid of low contrast keypoints
6. SIFT: Keypoint orientations
7. SIFT: Generating a feature
SIFT: Introduction
Matching features across different images in a common problem in computer vision. When all
images are similar in nature (same scale, orientation, etc) simple corner detectors can work. But
when you have images of different scales and rotations, you need to use the Scale Invariant
Feature Transform.
Why care about SIFT
SIFT isn't just scale invariant. You can change the following, and still get good results:
 Scale (duh)
 Rotation
 Illumination
 Viewpoint
Here's an example. We're looking for these:
And we want to find these objects in this scene:
2
Here's the result:
Now that's some real robust image matching going on. The big rectangles mark matched
images. The smaller squares are for individual features in those regions. Note how the big
rectangles are skewed. They follow the orientation and perspective of the object in the scene.
The algorithm
SIFT is quite an involved algorithm. It has a lot going on and can become confusing, So I've split
up the entire algorithm into multiple parts. Here's an outline of what happens in SIFT.
3
1. Constructing a scale space This is the initial preparation. You create internal
representations of the original image to ensure scale invariance. This is done by
generating a "scale space".
2. LoG Approximation The Laplacian of Gaussian is great for finding interesting points (or
key points) in an image. But it's computationally expensive. So we cheat and
approximate it using the representation created earlier.
3. Finding keypoints With the super fast approximation, we now try to find key points.
These are maxima and minima in the Difference of Gaussian image we calculate in step
2
4. Get rid of bad key points Edges and low contrast regions are bad keypoints.
Eliminating these makes the algorithm efficient and robust. A technique similar to the
Harris Corner Detector is used here.
5. Assigning an orientation to the keypoints An orientation is calculated for each key
point. Any further calculations are done relative to this orientation. This effectively
cancels out the effect of orientation, making it rotation invariant.
6. Generate SIFT features Finally, with scale and rotation invariance in place, one more
representation is generated. This helps uniquely identify features. Lets say you have
50,000 features. With this representation, you can easily identify the feature you're
looking for (say, a particular eye, or a sign board). That was an overview of the entire
algorithm. Over the next few days, I'll go through each step in detail. Finally, I'll show you
how to implement SIFT in OpenCV!
What do I do with SIFT features?
After you run through the algorithm, you'll have SIFT features for your image. Once you have
these, you can do whatever you want.
Track images, detect and identify objects (which can be partly hidden as well), or whatever you
can think of. We'll get into this later as well.
But the catch is, this algorithm is patented.
So, it's good enough for academic purposes. But if you're looking to make something
commercial, look for something else! [Thanks to aLu for pointing out SURF is patented too]
SIFT: The scale space
Real world objects are meaningful only at a certain scale. You might see a sugar cube perfectly
on a table. But if looking at the entire milky way, then it simply does not exist. This multi-scale
nature of objects is quite common in nature. And a scale space attempts to replicate this
concept on digital images.
Scale spaces
Do you want to look at a leaf or the entire tree? If it's a tree, get rid of some detail from the
image (like the leaves, twigs, etc) intentionally.
4
While getting rid of these details, you must ensure that you do not introduce new false details.
The only way to do that is with the Gaussian Blur (it was proved mathematically, under several
reasonable assumptions).
So to create a scale space, you take the original image and generate progressively blurred out
images. Here's an example:
5
Look at how the cat's helmet loses detail. So do it's whiskers.
Scale spaces in SIFT
SIFT takes scale spaces to the next level. You take the original image, and generate
progressively blurred out images. Then, you resize the original image to half size. And you
generate blurred out images again. And you keep repeating.
Here's what it would look like in SIFT:
6
Images of the same size (vertical) form an octave. Above are four octaves. Each octave has 5
images. The individual images are formed because of the increasing "scale" (the amount of
blur).
The technical details
Now that you know things the intuitive way, I'll get into a few technical details.
Octaves and Scales
The number of octaves and scale depends on the size of the original image. While programming
SIFT, you'll have to decide for yourself how many octaves and scales you want. However, the
creator of SIFT suggests that 4 octaves and 5 blur levels are ideal for the algorithm.
The first octave
If the original image is doubled in size and antialiased a bit (by blurring it) then the algorithm
produces more four times more keypoints. The more the keypoints, the better!
Blurring
Mathematically, "blurring" is referred to as the convolution of the gaussian operator and the
image. Gaussian blur has a particular expression or "operator" that is applied to each pixel.
What results is the blurred image.
The symbols:
 L is a blurred image
 G is the Gaussian Blur operator
 I is an image
 x,y are the location coordinates
 σ is the "scale" parameter. Think of it as the amount of blur. Greater the value, greater
the blur.
 The * is the convolution operation in x and y. It "applies" gaussian blur G onto the image
I.
This is the actual Gaussian Blur operator.
Amount of blurring
The amount of blurring in each image is important. It goes like this. Assume the amount of blur
in a particular image is σ. Then, the amount of blur in the next image will be k*σ. Here k is
whatever constant you choose.
7
This is a table of σ's for my current example. See how each σ differs by a factor sqrt(2) from the
previous one.
Summary
In the first step of SIFT, you generate several octaves of the original image. Each octave's
image size is half the previous one. Within an octave, images are progressively blurred using
the Gaussian Blur operator.
In the next step, we'll use all these octaves to generate Difference of Gaussian images.
SIFT: LoG approximations
In the previous step , we created the scale space of the image. The idea was to blur an image
progressively, shrink it, blur the small image progressively and so on. Now we use those blurred
images to generate another set of images, the Difference of Gaussians (DoG). These DoG
images are a great for finding out interesting key points in the image.
Laplacian of Gaussian
The Laplacian of Gaussian (LoG) operation goes like this. You take an image, and blur it a little.
And then, you calculate second order derivatives on it (or, the "laplacian"). This locates edges
and corners on the image. These edges and corners are good for finding keypoints.
But the second order derivative is extremely sensitive to noise. The blur smoothes it out the
noise and stabilizes the second order derivative.
The problem is, calculating all those second order derivatives is computationally intensive. So
we cheat a bit.
The Con
To generate Laplacian of Guassian images quickly, we use the scale space. We calculate the
difference between two consecutive scales. Or, the Difference of Gaussians. Here's how:
8
These Difference of Gaussian images are approximately equivalent to the Laplacian of
Gaussian. And we've replaced a computationally intensive process with a simple subtraction
(fast and efficient). Awesome!
These DoG images comes with another little goodie. These approximations are also "scale
invariant". What does that mean?
The Benefits
Just the Laplacian of Gaussian images aren't great. They are not scale invariant. That is, they
depend on the amount of blur you do. This is because of the Gaussian expression. (Don't panic
;) )
See the σ2
in the demonimator? That's the scale. If we somehow get rid of it, we'll have true
scale independence. So, if the laplacian of a gaussian is represented like this:
Then the scale invariant laplacian of gaussian would look like this:
But all these complexities are taken care of by the Difference of Gaussian operation. The
resultant images after the DoG operation are already multiplied by the σ2
. Great eh!
9
Oh! And it has also been proved that this scale invariant thingy produces much better trackable
points! Even better!
Side effects
You can't have benefits without side effects >.<
You know the DoG result is multiplied with σ2
. But it's also multiplied by another number. That
number is (k-1). This is the k we discussed in the previous step.
But we'll just be looking for the location of the maximums and minimums in the images. We'll
never check the actual values at those locations. So, this additional factor won't be a problem to
us. (Even if you multiply throughout by some constant, the maxima and minima stay at the same
location)
Example
Here's a gigantic image to demonstrate how this difference of Gaussians works.
10
11
In the image, I've done the subtraction for just one octave. The same thing is done for all
octaves. This generates DoG images of multiple sizes.
Summary
Two consecutive images in an octave are picked and one is subtracted from the other. Then the
next consecutive pair is taken, and the process repeats. This is done for all octaves. The
resulting images are an approximation of scale invariant laplacian of gaussian (which is good for
detecting keypoints). There are a few "drawbacks" due to the approximation, but they won't
affect the algorithm.
Next, we'll actually find some interesting keypoints. Maxima and Minima. Or, Maximums and
Minimums of the image.
SIFT: Finding keypoints
Up till now, we have generated a scale space and used the scale space to calculate the
Difference of Gaussians. Those are then used to calculate Laplacian of Gaussian
approximations that is scale invariant. I told you that they produce great key points. Here's how
it's done!
Finding key points is a two part process
1. Locate maxima/minima in DoG images
2. Find subpixel maxima/minima
Locate maxima/minima in DoG images
The first step is to coarsely locate the maxima and minima. This is simple. You iterate through
each pixel and check all it's neighbours. The check is done within the current image, and also
the one above and below it. Something like this:
12
X marks the current pixel. The green circles mark the neighbours. This way, a total of 26 checks
are made. X is marked as a "key point" if it is the greatest or least of all 26 neighbours.
Usually, a non-maxima or non-minima position won't have to go through all 26 checks. A few
initial checks will usually sufficient to discard it.
Note that keypoints are not detected in the lowermost and topmost scales. There simply aren't
enough neighbours to do the comparison. So simply skip them!
Once this is done, the marked points are the approximate maxima and minima. They are
"approximate" because the maxima/minima almost never lies exactly on a pixel. It lies
somewhere between the pixel. But we simply cannot access data "between" pixels. So, we must
mathematically locate the subpixel location.
Here's what I mean:
The red crosses mark pixels in the image. But the actual extreme point is the green one.
Find subpixel maxima/minima
Using the available pixel data, subpixel values are generated. This is done by the Taylor
expansion of the image around the approximate key point.
Mathematically, it's like this:
We can easily find the extreme points of this equation (differentiate and equate to zero). On
solving, we'll get subpixel key point locations. These subpixel values increase chances of
matching and stability of the algorithm.
Example
Here's a result I got from the example image I've been using till now:
13
14
The author of SIFT recommends generating two such extrema images. So, you need exactly 4
DoG images. To generate 4 DoG images, you need 5 Gaussian blurred images. Hence the 5
level of blurs in each octave.
In the image, I've shown just one octave. This is done for all octaves. Also, this image just
shows the first part of keypoint detection. The Taylor series part has been skipped.
Summary
Here, we detected the maxima and minima in the DoG images generated in the previous step.
This is done by comparing neighbouring pixels in the current scale, the scale "above" and the
scale "below".
Next, we'll reject some keypoints detected here. This is because they either don't have enough
contrast or they lie on an edge
SIFT: Getting rid of low contrast keypoints
Key points generated in the previous step produce a lot of key points. Some of them lie along an
edge, or they don't have enough contrast. In both cases, they are not useful as features. So we
get rid of them. The approach is similar to the one used in the Harris Corner Detector for
removing edge features. For low contrast features, we simply check their intensities.
Removing low contrast features
This is simple. If the magnitude of the intensity (i.e., without sign) at the current pixel in the DoG
image (that is being checked for minima/maxima) is less than a certain value, it is rejected.
Because we have subpixel keypoints (we used the Taylor expansion to refine keypoints), we
again need to use the taylor expansion to get the intensity value at subpixel locations. If it's
magnitude is less than a certain value, we reject the keypoint.
Removing edges
The idea is to calculate two gradients at the keypoint. Both perpendicular to each other. Based
on the image around the keypoint, three possibilities exist. The image around the keypoint can
be:
 A flat region: If this is the case, both gradients will be small.
 An edge: Here, one gradient will be big (perpendicular to the edge) and the other will be
small (along the edge)
 A "corner": Here, both gradients will be big.
15
Corners are great keypoints. So we want just corners. If both gradients are big enough, we let it
pass as a key point. Otherwise, it is rejected.
Mathematically, this is achieved by the Hessian Matrix. Using this matrix, you can easily check if
a point is a corner or not.
If you're interested in the math, first check the posts on the Harris corner detector. A lot of the
same math used used in SIFT. In the Harris Corner Detector, two eigenvalues are calculated. In
SIFT, efficiency is increased by just calculating the ratio of these two eigenvalues. You never
need to calculate the actual eigenvalues.
Example
Here's a visual example of what happens in this step:
Both extrema images go through the two tests: the contrast test and the edge test. They reject a
few keypoints (sometimes a lot) and thus, we're left with a lower number of keypoints to deal
with.
Summary
In this step, the number of keypoints was reduced. This helps increase efficiency and also the
robustness of the algorithm. Keypoints are rejected if they had a low contrast or if they were
located on an edge.
16
In the next step we'll assign an orientation to all the keypoints that passed both tests.
SIFT: Keypoint orientations
After step 4, we have legitimate key points. They've been tested to be stable. We already know
the scale at which the keypoint was detected (it's the same as the scale of the blurred image).
So we have scale invariance. The next thing is to assign an orientation to each keypoint. This
orientation provides rotation invariance. The more invariance you have the better it is. :P
The idea
The idea is to collect gradient directions and magnitudes around each keypoint. Then we figure
out the most prominent orientation(s) in that region. And we assign this orientation(s) to the
keypoint.
Any later calculations are done relative to this orientation. This ensures rotation invariance.
The size of the "orientation collection region" around the keypoint depends on it's scale. The
bigger the scale, the bigger the collection region.
The details
Now for the little details about collecting orientations.
17
Gradient magnitudes and orientations are calculated using these formulae:
The magnitude and orientation is calculated for all pixels around the keypoint. Then,
A histogram is created for this.
In this histogram, the 360 degrees of orientation are broken into 36 bins (each 10 degrees). Lets
say the gradient direction at a certain point (in the "orientation collection region") is 18.759
degrees, then it will go into the 10-19 degree bin. And the "amount" that is added to the bin is
proportional to the magnitude of gradient at that point.
Once you've done this for all pixels around the keypoint, the histogram will have a peak at some
point.
Above, you see the histogram peaks at 20-29 degrees. So, the keypoint is assigned orientation
3 (the third bin)
Also, any peaks above 80% of the highest peak are converted into a new keypoint. This new
keypoint has the same location and scale as the original. But it's orientation is equal to the other
peak.
So, orientation can split up one keypoint into multiple keypoints.
18
The Technical Details
Magnitudes
Saw the gradient magnitude image above? In SIFT, you need to blur it by an amount of
1.5*sigma.
Size of the window
The window size, or the "orientation collection region", is equal to the size of the kernel for
Gaussian Blur of amount 1.5*sigma.
Summary
To assign an orientation we use a histogram and a small region around it. Using the histogram,
the most prominent gradient orientation(s) are identified. If there is only one peak, it is assigned
to the keypoint. If there are multiple peaks above the 80% mark, they are all converted into a
new keypoint (with their respective orientations).
Next, we generate a highly distinctive "fingerprint" for each keypoint. Here's a little teaser. This
fingerprint, or "feature vector", has 128 different numbers.
19
SIFT: Generating a feature
Now for the final step of SIFT. Till now, we had scale and rotation invariance. Now we create a
fingerprint for each keypoint. This is to identify a keypoint. If an eye is a keypoint, then using this
fingerprint, we'll be able to distinguish it from other keypoints, like ears, noses, fingers, etc.
The idea
We want to generate a very unique fingerprint for the keypoint. It should be easy to calculate.
We also want it to be relatively lenient when it is being compared against other keypoints.
Things are never EXACTLY same when comparing two different images.
To do this, a 16x16 window around the keypoint. This 16x16 window is broken into sixteen 4x4
windows.
Within each 4x4 window, gradient magnitudes and orientations are calculated. These
orientations are put into an 8 bin histogram.
20
Any gradient orientation in the range 0-44 degrees add to the first bin. 45-89 add to the next bin.
And so on.And (as always) the amount added to the bin depends on the magnitude of the
gradient.
Unlike the past, the amount added also depends on the distance from the keypoint. So
gradients that are far away from the keypoint will add smaller values to the histogram.
This is done using a "gaussian weighting function". This function simply generates a gradient
(it's like a 2D bell curve). You multiple it with the magnitude of orientations, and you get a
weighted thingy. The farther away, the lesser the magnutide.
Doing this for all 16 pixels, you would've "compiled" 16 totally random orientations into 8
predetermined bins. You do this for all sixteen 4x4 regions. So you end up with 4x4x8 = 128
numbers. Once you have all 128 numbers, you normalize them (just like you would normalize a
vector in school, divide by root of sum of squares). These 128 numbers form the "feature
vector". This keypoint is uniquely identified by this feature vector.
You might have seen that in the pictures above, the keypoint lies "in between". It does not lie
exactly on a pixel. That's because it does not. The 16x16 window takes orientations and
magnitudes of the image "in-between" pixels. So you need to interpolate the image to generate
orientation and magnitude data "in between" pixels.
21
Problems
This feature vector introduces a few complications. We need to get rid of them before finalizing
the fingerprint.
1. Rotation dependence The feature vector uses gradient orientations. Clearly, if you
rotate the image, everything changes. All gradient orientations also change. To achieve
rotation independence, the keypoint's rotation is subtracted from each orientation. Thus
each gradient orientation is relative to the keypoint's orientation.
2. Illumination dependence If we threshold numbers that are big, we can achieve achieve
illumination independence. So, any number (of the 128) greater than 0.2 is changed to
0.2. This resultant feature vector is normalized again. And now you have an illumination
independent feature vector!
Summary
You take a 16x16 window of "in-between" pixels around the keypoint. You split that window into
sixteen 4x4 windows. From each 4x4 window you generate a histogram of 8 bins. Each bin
corresponding to 0-44 degrees, 45-89 degrees, etc. Gradient orientations from the 4x4 are put
into these bins. This is done for all 4x4 blocks. Finally, you normalize the 128 values you get.
To solve a few problems, you subtract the keypoint's orientation and also threshold the value of
each element of the feature vector to 0.2 (and normalize again).
The End!
Once you have the features, you go play with them! I'll get to that in a later post(or posts :P).
Read up on how the hough transform works. It will be used a lot.
22
Harris Corner Detector
The Harris Corner Detector is a mathematical operator that finds features (what are
features?) in an image. It is simple to compute, and is fast enough to work on
computers. Also, it is popular because it is rotation, scale and illumination variation
independent. However, the Shi-Tomasi corner detector, the one implemented in
OpenCV, is an improvement of this corner detector.
The mathematics
To define the Harris corner detector, we have to go into a bit of math. We'll get into a bit
of calculus, some matrix math, but trust me, it won't be tough. I'll make everything easy
to understand!
Our aim is to find little patches of image (or "windows") that generate a large variation
when moved around. Have a look at this image:
23
Markedareas have a lot variation
The red square is the window we've chosen. Moving it around doesn't show much of
variation. That is, the difference between the window, and the original image below it is
very low. So you can't really tell if the window "belongs" to that position.
Of course, if you move the window too much, like onto the reddish region, you're bound
to see a big difference. But we've moved the window too much. Not good.
Now have a look at this:
24
Regions with extremelyhigh variation
See? Even the little movement of the window produces a noticeable difference. This is
the kind of window we're looking for. Here's how it translates mathematically:
The equation
 E is the difference between the original and the moved window.
 u is the window's displacement in the x direction
 v is the window's displacement in the y direction
 w(x, y) is the window at position (x, y). This acts like a mask. Ensuring that only
the desired window is used.
 I is the intensity of the image at a position (x, y)
 I(x+u, y+v) is the intensity of the moved window
 I(x, y) is the intensity of the original
We've looking for windows that produce a large E value. To do that, we need to high
values of the terms inside the square brackets.
(Note: There's a little error in these equations. Can you figure it out? Answer below!)
So, we maximize this term:
Then, we expand this term using the Taylor series. Whats that? It's just a way of
rewriting this term in using its derivatives.
25
See how the I(x+u, y+v) changed into a totally different form ( I(x,y)+uIx + vIy )? Thats the
Taylor series in action. And because the Taylor series is infinite, we've ignored all terms
after the first three. It gives a pretty good approximation. But it isn't the actual value.
Next, we expand the square. The I(x,y) cancels out, so its just two terms we need to
square. It looks like this:
Now this messy equation can be tucked up into a neat little matrix form like this:
See how the entire equation gets converted into a neat little matrix!
(The error: There's no w(x, y) in these errors :P )
Now, we rename the summed-matrix, and put it to be M:
So the equation now becomes:
Looks so neat after all the clutter we did above.
Interesting windows
It was figured out that eigenvalues of the matrix can help determine the suitability of a
window. A score, R, is calculated for each window:
26
All windows that have a score R greater than a certain value are corners. They are good
tracking points.
Summary
The Harris Corner Detector is just a mathematical way of determining which windows
produce large variations when moved in any direction. With each window, a score R is
associated. Based on this score, you can figure out which ones are corners and which
ones are not.
OpenCV implements an improved version of this corner detector. It is called the Shi-
Tomasi corner detector.
27
Features: What are they?
Several computer vision tasks require finding matching points across several frames or
views. With that info, you could really do a lot of stuff. An example. When doing stereo
imaging, you want to know a few corresponding points between the two views. Once
you do, you can triangulate almost all points on the image (just like the brain does!).
The first approach: Patches
Intuitively, you'd be tempted to match small "patches" between the two images.
Something like this:
You want to find the left image's green path in the right. And you can do that quite
easily. The white thingy makes the patch quite unique. Even something as trivial
as template matching would be able to find it.
But, not all patches are so uniquely recognizable. Check the patches below:
28
There are no unique "features" to identify on the wall. So you'll have problem finding
corresponding points. So the patches approach isn't that great.
Corners
Corners in an image seem to be perfect for such tracking tasks. Here's an example
image:
These corners are perfect! Why?
29
Uniquely identifiable
These points are uniquely identifiable. What do I mean by that? Here's what. Lets say
you're trying to find the green corner in the right image (in the image below). You know
that it'll be somewhere around the same location. So you can narrow down the "search
region". And within this search region, there would be only one point that resembles the
corner.
Of course, the assumption here is that there isn't a massive difference between the two
images. And this is usually a reasonable assumption.
Stable
These points usually don't keep moving around in the image. This helps tracking. And
any motion of this point, even a little one, produces a large variation. You can clearly
see the point moving around.
30
A bad feature
I'll try to make the idea of a "corner" more concrete. We'll use some math to do this.
How do you identify a bad feature? Something that doesn't have a lot of variation... like
the example of the wall above.
There are no edges or corners in the feature. So, the first derivative is flat, in both
directions, x and y. Or, the first derivative is flat in all directions (all directions are a
certain combination of the x and y component).
So, the second derivative also does not change in any direction.
An edge
An edge is a bad feature as well. If you move in the direction of the edge, you won't
even know you're moving. For example, if you move along the edge at the top of the
building and the sky, you won't even realize it.
31
But if you move from the building to the sky (perpendicular to the edge) you'll "see"
motion. This is the only direction where you can accurately tell how fast the object is
moving.
So edges aren't that useful as features.
The first derivative changes in only the one direction (perpendicular to the edge). Some
progress, but not that good. So, the second derivative also changes in only one
direction.
A corner
A corner is an awesome feature! There's variation all around a corner. So, the derivative
changes in all directions. So the second derivative also changes in all directions! Great!
And you can write pretty efficient programs to calculate that at all points.
So there you have it! Identifying good features: If the first derivative keeps changing
around a point, you know you have a corner. And you also know you have a good
feature to track!
32
So what exactly is a feature?
By now, you've hopefully have an idea of what a feature is. At least intuitively. If not, go
read the post again... because there are no formal definitions of a feature till now. :P
Convolutions
Convolutions is a technique for general signal processing. People studying
electrical/electronics will tell you the near infinite sleepless nights these convolutions
have given them. Entire books have been written on this topic. And the questions and
theorems that need to be proved are [insurmountable]. But for computer vision, we'll just
deal with some simple things.
The Kernel
A convolution lets you do many things, like calculate derivatives, detect edges, apply
blurs, etc. A very wide variety of things. And all of this is done with a "convolution
kernel".
The convolution kernel is a small matrix. This matrix has numbers in each cell and has
an anchor point:
33
This kernel slides over an image and does its thing. The "anchor" point is used to
determine the position of the kernel with respect to the image.
The transformation
The anchor point starts at the top-left corner of the image and moves over each pixel
sequentially. At each position, the kernel overlaps a few pixels on the image. Each
overlapping pair of numbers is multiplied and added. Finally, the value at the current
position is set to this sum.
Here's an example:
34
The matrix on the left is the image and the one on the right is the kernel. Suppose the
kernel is at the highlighted position. So the '9' of the kernel overlaps with the '4' of the
image. So you calculate their product: 36. Next, '3' of the kernel overlaps the '3' of the
image. So you multiply: 9. Then you add it to 36. So you get a sum of 36+9=45.
Similarly, you do for all the remaining 7 overlapping values. You'll get a total sum. This
sum is stored in place of '2' (in the image).
Speed optimizations
The most direct way to compute a convolution would be to use multiple for loops. But
that causes a lot of repeated calculations. And as the size of the image and kernel
increases, the time to compute the convolution increases too (quite drastically).
Techniques haves been developed to calculate convolutions rapidly. One such
technique is using the Discrete Fourier Transform. It converts the entire convolution
operation into a simple multiplication. Fortunately, you don't need to know the math to
do this in OpenCV. It automatically decides whether to do it in frequency domain (after
the DFT) or not.
35
Problematic corners and edges
The kernel is two dimensional. So you have problems when the kernel is near the edges
or corners. Here's an example: If the kernel (in the above example) is on the top right
position, the '0' of the kernel will be over the '3' in the image. But the '1' will be outside
the image. So we have no idea what to do with it. Two things are possible:
 Ignore the ones -or-
 Do something about the edges Usually people choose to do something about it.
They create extra pixels near the edges. There are a few ways to create extra
pixels:
 Set a constant value for these pixels
 Duplicate edge pixels
 Reflect edges (like a mirror effect)
 Warp the image around (copy pixels from the other end)
This usually fixes the problems that might arise.
Summary
You learned a powerful technique that can be used for a lot of different purposes. We'll
see a few of those next.
36
Image convolution examples
A convolution is very useful for signal processing in general. There is a lot of complex
mathematical theory available for convolutions. For digital image processing, you don't
have to understand all of that. You can use a simple matrix as an image convolution
kernel and do some interesting things!
Simple box blur
Here's a first and simplest. This convolution kernel has an averaging effect. So you end
up with a slight blur. The image convolution kernel is:
Note that the sum of all elements of this matrix is 1.0. This is important. If the sum is not
exactly one, the resultant image will be brighter or darker.
Here's a blur that I got on an image:
A simple blur done with convolutions
37
Gaussian blur
Gaussian blur has certain mathematical properties that makes it important for computer
vision. And you can approximate it with an image convolution. The image convolution
kernel for a Gaussian blur is:
Here's a result that I got:
Line detection with image
convolutions
With image convolutions, you can easily detect lines. Here are four convolutions to
detect horizontal, vertical and lines at 45 degrees:
38
I looked for horizontal lines on the house image.
The result I got for this image convolution was:
Edge detection
The above kernels are in a way edge detectors. Only thing is that they have separate
components for horizontal and vertical lines. A way to "combine" the results is to merge
the convolution kernels. The new image convolution kernel looks like this:
Below result I got with edge detection:
39
The Sobel Edge Operator
The above operators are very prone to noise. The Sobel edge operators have a
smoothing effect, so they're less affected to noise. Again, there's a horizontal
component and a vertical component.
On applying this image convolution, the result was:
The laplacian operator
40
The laplacian is the second derivative of the image. It is extremely sensitive to noise, so
it isn't used as much as other operators. Unless, of course you have specific
requirements.
Here's the result with the convolution kernel without diagonals:
The Laplacian of Gaussian
The laplacian alone has the disadvantage of being extremely sensitive to noise. So,
smoothing the image before a laplacian improves the results we get. This is done with a
5x5 image convolution kernel.
41
The result on applying this image convolution was:
Summary
You got to know about some important operations that can be approximated using an
image convolution. You learned the exact convolution kernels used and also saw an
example of how each operator modifies an image. I hope this helped!
Optical flow or optic flow is the pattern of apparent motion of
objects, surfaces, and edges in a visual scene caused by the
relative motion between an observer (an eye or a camera) and
the scene.
Meanshift
The intuition behind the meanshift is simple. Consider you have a set of
points. (It can be a pixel distribution like histogram backprojection). You are
given a small window ( may be a circle) and you have to move that window to
42
the area of maximum pixel density (or maximum number of points). It is
illustrated in the simple image given below:
image
The initial window is shown in blue circle with the name "C1". Its original
center is marked in blue rectangle, named "C1_o". But if you find the centroid
of the points inside that window, you will get the point "C1_r" (marked in small
blue circle) which is the real centroid of window. Surely they don't match. So
move your window such that circle of the new window matches with previous
centroid. Again find the new centroid. Most probably, it won't match. So move
it again, and continue the iterations such that center of window and its
centroid falls on the same location (or with a small desired error). So finally
what you obtain is a window with maximum pixel distribution. It is marked with
green circle, named "C2". As you can see in image, it has maximum number
of points. The whole process is demonstrated on a static image below:
43
image
So we normally pass the histogram backprojected image and initial target
location. When the object moves, obviously the movement is reflected in
histogram backprojected image. As a result, meanshift algorithm moves our
window to the new location with maximum density.
44
The Shi-Tomasi Corner
Detector
The Shi-Tomasi corner detector is based entirely on the Harris corner detector.
However, one slight variation in a "selection criteria" made this detector much better
than the original. It works quite well where even the Harris corner detector fails. So
here's the minor change that Shi and Tomasi did to the original Harris corner detector.
The change
The Harris corner detector has a corner selection criteria. A score is calculated for each
pixel, and if the score is above a certain value, the pixel is marked as a corner. The
45
score is calculated using two eigenvalues. That is, you gave the two eigenvalues to a
function. The function manipulates them, and gave back a score.
You can read more about how interesting windows in the Harris corner detector are
selected.
Shi and Tomasi suggested that the function should be done away with. Only the
eigenvalues should be used to check if the pixel was a corner or not.
The score for Harris corner detector was calculated like this (R is the score):
For Shi-Tomasi, it's calculated like this:
In their paper, Shi and Tomasi demonstrated experimentally that this score criteria was
much better. If R is greater than a certain predefined value, it can be marked as a
corner. Thus, the effect region for a point to be a corner is something like this:
46
 Green: both λ1 and λ2 are greater than a certain value. Thus, this region is for
pixels "accepted" as corners.
 In the blue and gray regions, either λ1 or λ2 is less than he required minimum.
 In the red region, both λ1 and λ2 are less than the required minimum. Compare
the above with a similar graph for Harris corner detector... You'll see the blue and
gray areas are equivalent to the "edge" areas. The red region is for "flat" areas.
The green is for corners.
Summary
The Shi-Tomasi corner detector is a complete ripoff of the Harris corner detector, except
for a minor change they did :P However, it is much better than the original corner
detector, so people use it a lot more. Also, OpenCV implements the Shi-Tomasi corner
detection algorithm.

Contenu connexe

Tendances

Michal Erel's SIFT presentation
Michal Erel's SIFT presentationMichal Erel's SIFT presentation
Michal Erel's SIFT presentationwolf
 
Edge linking in image processing
Edge linking in image processingEdge linking in image processing
Edge linking in image processingVARUN KUMAR
 
Lecture 1 for Digital Image Processing (2nd Edition)
Lecture 1 for Digital Image Processing (2nd Edition)Lecture 1 for Digital Image Processing (2nd Edition)
Lecture 1 for Digital Image Processing (2nd Edition)Moe Moe Myint
 
Log Transformation in Image Processing with Example
Log Transformation in Image Processing with ExampleLog Transformation in Image Processing with Example
Log Transformation in Image Processing with ExampleMustak Ahmmed
 
Scale Invariant feature transform
Scale Invariant feature transformScale Invariant feature transform
Scale Invariant feature transformShanker Naik
 
Image Enhancement in Spatial Domain
Image Enhancement in Spatial DomainImage Enhancement in Spatial Domain
Image Enhancement in Spatial DomainDEEPASHRI HK
 
Fuzzy image processing- fuzzy C-mean clustering
Fuzzy image processing- fuzzy C-mean clusteringFuzzy image processing- fuzzy C-mean clustering
Fuzzy image processing- fuzzy C-mean clusteringFarah M. Altufaili
 
Image segmentation ppt
Image segmentation pptImage segmentation ppt
Image segmentation pptGichelle Amon
 
Image Interpolation Techniques with Optical and Digital Zoom Concepts
Image Interpolation Techniques with Optical and Digital Zoom ConceptsImage Interpolation Techniques with Optical and Digital Zoom Concepts
Image Interpolation Techniques with Optical and Digital Zoom Conceptsmmjalbiaty
 
SURF - Speeded Up Robust Features
SURF - Speeded Up Robust FeaturesSURF - Speeded Up Robust Features
SURF - Speeded Up Robust FeaturesMarta Lopes
 
Spatial Filters (Digital Image Processing)
Spatial Filters (Digital Image Processing)Spatial Filters (Digital Image Processing)
Spatial Filters (Digital Image Processing)Kalyan Acharjya
 
Fourier descriptors & moments
Fourier descriptors & momentsFourier descriptors & moments
Fourier descriptors & momentsrajisri2
 
Deep learning for image video processing
Deep learning for image video processingDeep learning for image video processing
Deep learning for image video processingYu Huang
 
Frequency Domain Image Enhancement Techniques
Frequency Domain Image Enhancement TechniquesFrequency Domain Image Enhancement Techniques
Frequency Domain Image Enhancement TechniquesDiwaker Pant
 
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic SegmentationSemantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation岳華 杜
 

Tendances (20)

Michal Erel's SIFT presentation
Michal Erel's SIFT presentationMichal Erel's SIFT presentation
Michal Erel's SIFT presentation
 
Edge linking in image processing
Edge linking in image processingEdge linking in image processing
Edge linking in image processing
 
Lecture 1 for Digital Image Processing (2nd Edition)
Lecture 1 for Digital Image Processing (2nd Edition)Lecture 1 for Digital Image Processing (2nd Edition)
Lecture 1 for Digital Image Processing (2nd Edition)
 
Computer Vision
Computer VisionComputer Vision
Computer Vision
 
Log Transformation in Image Processing with Example
Log Transformation in Image Processing with ExampleLog Transformation in Image Processing with Example
Log Transformation in Image Processing with Example
 
Scale Invariant feature transform
Scale Invariant feature transformScale Invariant feature transform
Scale Invariant feature transform
 
Image recognition
Image recognitionImage recognition
Image recognition
 
Image Enhancement in Spatial Domain
Image Enhancement in Spatial DomainImage Enhancement in Spatial Domain
Image Enhancement in Spatial Domain
 
Fuzzy image processing- fuzzy C-mean clustering
Fuzzy image processing- fuzzy C-mean clusteringFuzzy image processing- fuzzy C-mean clustering
Fuzzy image processing- fuzzy C-mean clustering
 
Image segmentation ppt
Image segmentation pptImage segmentation ppt
Image segmentation ppt
 
Image Interpolation Techniques with Optical and Digital Zoom Concepts
Image Interpolation Techniques with Optical and Digital Zoom ConceptsImage Interpolation Techniques with Optical and Digital Zoom Concepts
Image Interpolation Techniques with Optical and Digital Zoom Concepts
 
SURF - Speeded Up Robust Features
SURF - Speeded Up Robust FeaturesSURF - Speeded Up Robust Features
SURF - Speeded Up Robust Features
 
Spatial Filters (Digital Image Processing)
Spatial Filters (Digital Image Processing)Spatial Filters (Digital Image Processing)
Spatial Filters (Digital Image Processing)
 
Fourier descriptors & moments
Fourier descriptors & momentsFourier descriptors & moments
Fourier descriptors & moments
 
Deep learning for image video processing
Deep learning for image video processingDeep learning for image video processing
Deep learning for image video processing
 
Sharpening spatial filters
Sharpening spatial filtersSharpening spatial filters
Sharpening spatial filters
 
Spatial domain and filtering
Spatial domain and filteringSpatial domain and filtering
Spatial domain and filtering
 
Frequency Domain Image Enhancement Techniques
Frequency Domain Image Enhancement TechniquesFrequency Domain Image Enhancement Techniques
Frequency Domain Image Enhancement Techniques
 
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic SegmentationSemantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
 
Edge detection
Edge detectionEdge detection
Edge detection
 

En vedette

Scale Invariant Feature Tranform
Scale Invariant Feature TranformScale Invariant Feature Tranform
Scale Invariant Feature TranformShanker Naik
 
Feature Matching using SIFT algorithm
Feature Matching using SIFT algorithmFeature Matching using SIFT algorithm
Feature Matching using SIFT algorithmSajid Pareeth
 
Contextless Object Recognition with Shape-enriched SIFT and Bags of Features
Contextless Object Recognition with Shape-enriched SIFT and Bags of FeaturesContextless Object Recognition with Shape-enriched SIFT and Bags of Features
Contextless Object Recognition with Shape-enriched SIFT and Bags of FeaturesUniversitat Politècnica de Catalunya
 
SIFT Algorithm Introduction
SIFT Algorithm IntroductionSIFT Algorithm Introduction
SIFT Algorithm IntroductionTruong LD
 
SIFT vs other Feature Descriptor
SIFT vs other Feature DescriptorSIFT vs other Feature Descriptor
SIFT vs other Feature DescriptorNisar Ahmed Rana
 

En vedette (7)

Scale Invariant Feature Tranform
Scale Invariant Feature TranformScale Invariant Feature Tranform
Scale Invariant Feature Tranform
 
Feature Matching using SIFT algorithm
Feature Matching using SIFT algorithmFeature Matching using SIFT algorithm
Feature Matching using SIFT algorithm
 
Ijetcas14 379
Ijetcas14 379Ijetcas14 379
Ijetcas14 379
 
Lec07 aggregation-and-retrieval-system
Lec07 aggregation-and-retrieval-systemLec07 aggregation-and-retrieval-system
Lec07 aggregation-and-retrieval-system
 
Contextless Object Recognition with Shape-enriched SIFT and Bags of Features
Contextless Object Recognition with Shape-enriched SIFT and Bags of FeaturesContextless Object Recognition with Shape-enriched SIFT and Bags of Features
Contextless Object Recognition with Shape-enriched SIFT and Bags of Features
 
SIFT Algorithm Introduction
SIFT Algorithm IntroductionSIFT Algorithm Introduction
SIFT Algorithm Introduction
 
SIFT vs other Feature Descriptor
SIFT vs other Feature DescriptorSIFT vs other Feature Descriptor
SIFT vs other Feature Descriptor
 

Similaire à Scale invariant feature transform

Sift detector boosted by adaptive contrast threshold to improve matching robu...
Sift detector boosted by adaptive contrast threshold to improve matching robu...Sift detector boosted by adaptive contrast threshold to improve matching robu...
Sift detector boosted by adaptive contrast threshold to improve matching robu...Venkat Projects
 
Sift detector boosted by adaptive contrast threshold to improve matching robu...
Sift detector boosted by adaptive contrast threshold to improve matching robu...Sift detector boosted by adaptive contrast threshold to improve matching robu...
Sift detector boosted by adaptive contrast threshold to improve matching robu...Venkat Projects
 
Practical Digital Image Processing 4
Practical Digital Image Processing 4Practical Digital Image Processing 4
Practical Digital Image Processing 4Aly Abdelkareem
 
06 image features
06 image features06 image features
06 image featuresankit_ppt
 
Linear logisticregression
Linear logisticregressionLinear logisticregression
Linear logisticregressionkongara
 
Convolutional neural network complete guide
Convolutional neural network complete guideConvolutional neural network complete guide
Convolutional neural network complete guideMLTUT
 
What goes on during haar cascade face detection
What goes on during haar cascade face detectionWhat goes on during haar cascade face detection
What goes on during haar cascade face detectionOnibiyo Joshua Toluse
 
Implement the morphological operations: Dilation, Erosion, Opening and Closing
Implement the morphological operations: Dilation, Erosion, Opening and ClosingImplement the morphological operations: Dilation, Erosion, Opening and Closing
Implement the morphological operations: Dilation, Erosion, Opening and ClosingNational Cheng Kung University
 
Portfolio for CS 6475 Computational Photography
Portfolio for CS 6475 Computational PhotographyPortfolio for CS 6475 Computational Photography
Portfolio for CS 6475 Computational PhotographySenthilkumar Gopal
 
JonathanWestlake_ComputerVision_Project1
JonathanWestlake_ComputerVision_Project1JonathanWestlake_ComputerVision_Project1
JonathanWestlake_ComputerVision_Project1Jonathan Westlake
 
Practical Digital Image Processing 2
Practical Digital Image Processing 2Practical Digital Image Processing 2
Practical Digital Image Processing 2Aly Abdelkareem
 
Computer Vision - Alignment and Tracking.pptx
Computer Vision - Alignment and Tracking.pptxComputer Vision - Alignment and Tracking.pptx
Computer Vision - Alignment and Tracking.pptxothersk46
 
GUI based Face detection using Viola-Jones algorithm in MATLAB.
GUI based Face detection using Viola-Jones algorithm in MATLAB.GUI based Face detection using Viola-Jones algorithm in MATLAB.
GUI based Face detection using Viola-Jones algorithm in MATLAB.Binita Khua
 
Practical Digital Image Processing 3
 Practical Digital Image Processing 3 Practical Digital Image Processing 3
Practical Digital Image Processing 3Aly Abdelkareem
 
Gm tutorial adding depth tutorial by game maker
Gm tutorial   adding depth tutorial by game makerGm tutorial   adding depth tutorial by game maker
Gm tutorial adding depth tutorial by game makerJin Toples
 
[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of FunctionsJaeJun Yoo
 
11 cie552 image_featuresii_sift
11 cie552 image_featuresii_sift11 cie552 image_featuresii_sift
11 cie552 image_featuresii_siftElsayed Hemayed
 

Similaire à Scale invariant feature transform (20)

Sift detector boosted by adaptive contrast threshold to improve matching robu...
Sift detector boosted by adaptive contrast threshold to improve matching robu...Sift detector boosted by adaptive contrast threshold to improve matching robu...
Sift detector boosted by adaptive contrast threshold to improve matching robu...
 
Sift detector boosted by adaptive contrast threshold to improve matching robu...
Sift detector boosted by adaptive contrast threshold to improve matching robu...Sift detector boosted by adaptive contrast threshold to improve matching robu...
Sift detector boosted by adaptive contrast threshold to improve matching robu...
 
Image Stitching for Panorama View
Image Stitching for Panorama ViewImage Stitching for Panorama View
Image Stitching for Panorama View
 
Hog and sift
Hog and siftHog and sift
Hog and sift
 
Practical Digital Image Processing 4
Practical Digital Image Processing 4Practical Digital Image Processing 4
Practical Digital Image Processing 4
 
06 image features
06 image features06 image features
06 image features
 
Linear logisticregression
Linear logisticregressionLinear logisticregression
Linear logisticregression
 
Convolutional neural network complete guide
Convolutional neural network complete guideConvolutional neural network complete guide
Convolutional neural network complete guide
 
What goes on during haar cascade face detection
What goes on during haar cascade face detectionWhat goes on during haar cascade face detection
What goes on during haar cascade face detection
 
Implement the morphological operations: Dilation, Erosion, Opening and Closing
Implement the morphological operations: Dilation, Erosion, Opening and ClosingImplement the morphological operations: Dilation, Erosion, Opening and Closing
Implement the morphological operations: Dilation, Erosion, Opening and Closing
 
Portfolio for CS 6475 Computational Photography
Portfolio for CS 6475 Computational PhotographyPortfolio for CS 6475 Computational Photography
Portfolio for CS 6475 Computational Photography
 
JonathanWestlake_ComputerVision_Project1
JonathanWestlake_ComputerVision_Project1JonathanWestlake_ComputerVision_Project1
JonathanWestlake_ComputerVision_Project1
 
Practical Digital Image Processing 2
Practical Digital Image Processing 2Practical Digital Image Processing 2
Practical Digital Image Processing 2
 
Computer Vision - Alignment and Tracking.pptx
Computer Vision - Alignment and Tracking.pptxComputer Vision - Alignment and Tracking.pptx
Computer Vision - Alignment and Tracking.pptx
 
GUI based Face detection using Viola-Jones algorithm in MATLAB.
GUI based Face detection using Viola-Jones algorithm in MATLAB.GUI based Face detection using Viola-Jones algorithm in MATLAB.
GUI based Face detection using Viola-Jones algorithm in MATLAB.
 
Practical Digital Image Processing 3
 Practical Digital Image Processing 3 Practical Digital Image Processing 3
Practical Digital Image Processing 3
 
Gm tutorial adding depth tutorial by game maker
Gm tutorial   adding depth tutorial by game makerGm tutorial   adding depth tutorial by game maker
Gm tutorial adding depth tutorial by game maker
 
[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions
 
11 cie552 image_featuresii_sift
11 cie552 image_featuresii_sift11 cie552 image_featuresii_sift
11 cie552 image_featuresii_sift
 
linkd
linkdlinkd
linkd
 

Dernier

2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdfssuserdda66b
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxcallscotland1987
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 

Dernier (20)

2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 

Scale invariant feature transform

  • 1. 1 This tutorial is part of a series called SIFT: Theory and Practice: 1. SIFT: Introduction 2. SIFT: The scale space 3. SIFT: LoG approximations 4. SIFT: Finding keypoints 5. SIFT: Getting rid of low contrast keypoints 6. SIFT: Keypoint orientations 7. SIFT: Generating a feature SIFT: Introduction Matching features across different images in a common problem in computer vision. When all images are similar in nature (same scale, orientation, etc) simple corner detectors can work. But when you have images of different scales and rotations, you need to use the Scale Invariant Feature Transform. Why care about SIFT SIFT isn't just scale invariant. You can change the following, and still get good results:  Scale (duh)  Rotation  Illumination  Viewpoint Here's an example. We're looking for these: And we want to find these objects in this scene:
  • 2. 2 Here's the result: Now that's some real robust image matching going on. The big rectangles mark matched images. The smaller squares are for individual features in those regions. Note how the big rectangles are skewed. They follow the orientation and perspective of the object in the scene. The algorithm SIFT is quite an involved algorithm. It has a lot going on and can become confusing, So I've split up the entire algorithm into multiple parts. Here's an outline of what happens in SIFT.
  • 3. 3 1. Constructing a scale space This is the initial preparation. You create internal representations of the original image to ensure scale invariance. This is done by generating a "scale space". 2. LoG Approximation The Laplacian of Gaussian is great for finding interesting points (or key points) in an image. But it's computationally expensive. So we cheat and approximate it using the representation created earlier. 3. Finding keypoints With the super fast approximation, we now try to find key points. These are maxima and minima in the Difference of Gaussian image we calculate in step 2 4. Get rid of bad key points Edges and low contrast regions are bad keypoints. Eliminating these makes the algorithm efficient and robust. A technique similar to the Harris Corner Detector is used here. 5. Assigning an orientation to the keypoints An orientation is calculated for each key point. Any further calculations are done relative to this orientation. This effectively cancels out the effect of orientation, making it rotation invariant. 6. Generate SIFT features Finally, with scale and rotation invariance in place, one more representation is generated. This helps uniquely identify features. Lets say you have 50,000 features. With this representation, you can easily identify the feature you're looking for (say, a particular eye, or a sign board). That was an overview of the entire algorithm. Over the next few days, I'll go through each step in detail. Finally, I'll show you how to implement SIFT in OpenCV! What do I do with SIFT features? After you run through the algorithm, you'll have SIFT features for your image. Once you have these, you can do whatever you want. Track images, detect and identify objects (which can be partly hidden as well), or whatever you can think of. We'll get into this later as well. But the catch is, this algorithm is patented. So, it's good enough for academic purposes. But if you're looking to make something commercial, look for something else! [Thanks to aLu for pointing out SURF is patented too] SIFT: The scale space Real world objects are meaningful only at a certain scale. You might see a sugar cube perfectly on a table. But if looking at the entire milky way, then it simply does not exist. This multi-scale nature of objects is quite common in nature. And a scale space attempts to replicate this concept on digital images. Scale spaces Do you want to look at a leaf or the entire tree? If it's a tree, get rid of some detail from the image (like the leaves, twigs, etc) intentionally.
  • 4. 4 While getting rid of these details, you must ensure that you do not introduce new false details. The only way to do that is with the Gaussian Blur (it was proved mathematically, under several reasonable assumptions). So to create a scale space, you take the original image and generate progressively blurred out images. Here's an example:
  • 5. 5 Look at how the cat's helmet loses detail. So do it's whiskers. Scale spaces in SIFT SIFT takes scale spaces to the next level. You take the original image, and generate progressively blurred out images. Then, you resize the original image to half size. And you generate blurred out images again. And you keep repeating. Here's what it would look like in SIFT:
  • 6. 6 Images of the same size (vertical) form an octave. Above are four octaves. Each octave has 5 images. The individual images are formed because of the increasing "scale" (the amount of blur). The technical details Now that you know things the intuitive way, I'll get into a few technical details. Octaves and Scales The number of octaves and scale depends on the size of the original image. While programming SIFT, you'll have to decide for yourself how many octaves and scales you want. However, the creator of SIFT suggests that 4 octaves and 5 blur levels are ideal for the algorithm. The first octave If the original image is doubled in size and antialiased a bit (by blurring it) then the algorithm produces more four times more keypoints. The more the keypoints, the better! Blurring Mathematically, "blurring" is referred to as the convolution of the gaussian operator and the image. Gaussian blur has a particular expression or "operator" that is applied to each pixel. What results is the blurred image. The symbols:  L is a blurred image  G is the Gaussian Blur operator  I is an image  x,y are the location coordinates  σ is the "scale" parameter. Think of it as the amount of blur. Greater the value, greater the blur.  The * is the convolution operation in x and y. It "applies" gaussian blur G onto the image I. This is the actual Gaussian Blur operator. Amount of blurring The amount of blurring in each image is important. It goes like this. Assume the amount of blur in a particular image is σ. Then, the amount of blur in the next image will be k*σ. Here k is whatever constant you choose.
  • 7. 7 This is a table of σ's for my current example. See how each σ differs by a factor sqrt(2) from the previous one. Summary In the first step of SIFT, you generate several octaves of the original image. Each octave's image size is half the previous one. Within an octave, images are progressively blurred using the Gaussian Blur operator. In the next step, we'll use all these octaves to generate Difference of Gaussian images. SIFT: LoG approximations In the previous step , we created the scale space of the image. The idea was to blur an image progressively, shrink it, blur the small image progressively and so on. Now we use those blurred images to generate another set of images, the Difference of Gaussians (DoG). These DoG images are a great for finding out interesting key points in the image. Laplacian of Gaussian The Laplacian of Gaussian (LoG) operation goes like this. You take an image, and blur it a little. And then, you calculate second order derivatives on it (or, the "laplacian"). This locates edges and corners on the image. These edges and corners are good for finding keypoints. But the second order derivative is extremely sensitive to noise. The blur smoothes it out the noise and stabilizes the second order derivative. The problem is, calculating all those second order derivatives is computationally intensive. So we cheat a bit. The Con To generate Laplacian of Guassian images quickly, we use the scale space. We calculate the difference between two consecutive scales. Or, the Difference of Gaussians. Here's how:
  • 8. 8 These Difference of Gaussian images are approximately equivalent to the Laplacian of Gaussian. And we've replaced a computationally intensive process with a simple subtraction (fast and efficient). Awesome! These DoG images comes with another little goodie. These approximations are also "scale invariant". What does that mean? The Benefits Just the Laplacian of Gaussian images aren't great. They are not scale invariant. That is, they depend on the amount of blur you do. This is because of the Gaussian expression. (Don't panic ;) ) See the σ2 in the demonimator? That's the scale. If we somehow get rid of it, we'll have true scale independence. So, if the laplacian of a gaussian is represented like this: Then the scale invariant laplacian of gaussian would look like this: But all these complexities are taken care of by the Difference of Gaussian operation. The resultant images after the DoG operation are already multiplied by the σ2 . Great eh!
  • 9. 9 Oh! And it has also been proved that this scale invariant thingy produces much better trackable points! Even better! Side effects You can't have benefits without side effects >.< You know the DoG result is multiplied with σ2 . But it's also multiplied by another number. That number is (k-1). This is the k we discussed in the previous step. But we'll just be looking for the location of the maximums and minimums in the images. We'll never check the actual values at those locations. So, this additional factor won't be a problem to us. (Even if you multiply throughout by some constant, the maxima and minima stay at the same location) Example Here's a gigantic image to demonstrate how this difference of Gaussians works.
  • 10. 10
  • 11. 11 In the image, I've done the subtraction for just one octave. The same thing is done for all octaves. This generates DoG images of multiple sizes. Summary Two consecutive images in an octave are picked and one is subtracted from the other. Then the next consecutive pair is taken, and the process repeats. This is done for all octaves. The resulting images are an approximation of scale invariant laplacian of gaussian (which is good for detecting keypoints). There are a few "drawbacks" due to the approximation, but they won't affect the algorithm. Next, we'll actually find some interesting keypoints. Maxima and Minima. Or, Maximums and Minimums of the image. SIFT: Finding keypoints Up till now, we have generated a scale space and used the scale space to calculate the Difference of Gaussians. Those are then used to calculate Laplacian of Gaussian approximations that is scale invariant. I told you that they produce great key points. Here's how it's done! Finding key points is a two part process 1. Locate maxima/minima in DoG images 2. Find subpixel maxima/minima Locate maxima/minima in DoG images The first step is to coarsely locate the maxima and minima. This is simple. You iterate through each pixel and check all it's neighbours. The check is done within the current image, and also the one above and below it. Something like this:
  • 12. 12 X marks the current pixel. The green circles mark the neighbours. This way, a total of 26 checks are made. X is marked as a "key point" if it is the greatest or least of all 26 neighbours. Usually, a non-maxima or non-minima position won't have to go through all 26 checks. A few initial checks will usually sufficient to discard it. Note that keypoints are not detected in the lowermost and topmost scales. There simply aren't enough neighbours to do the comparison. So simply skip them! Once this is done, the marked points are the approximate maxima and minima. They are "approximate" because the maxima/minima almost never lies exactly on a pixel. It lies somewhere between the pixel. But we simply cannot access data "between" pixels. So, we must mathematically locate the subpixel location. Here's what I mean: The red crosses mark pixels in the image. But the actual extreme point is the green one. Find subpixel maxima/minima Using the available pixel data, subpixel values are generated. This is done by the Taylor expansion of the image around the approximate key point. Mathematically, it's like this: We can easily find the extreme points of this equation (differentiate and equate to zero). On solving, we'll get subpixel key point locations. These subpixel values increase chances of matching and stability of the algorithm. Example Here's a result I got from the example image I've been using till now:
  • 13. 13
  • 14. 14 The author of SIFT recommends generating two such extrema images. So, you need exactly 4 DoG images. To generate 4 DoG images, you need 5 Gaussian blurred images. Hence the 5 level of blurs in each octave. In the image, I've shown just one octave. This is done for all octaves. Also, this image just shows the first part of keypoint detection. The Taylor series part has been skipped. Summary Here, we detected the maxima and minima in the DoG images generated in the previous step. This is done by comparing neighbouring pixels in the current scale, the scale "above" and the scale "below". Next, we'll reject some keypoints detected here. This is because they either don't have enough contrast or they lie on an edge SIFT: Getting rid of low contrast keypoints Key points generated in the previous step produce a lot of key points. Some of them lie along an edge, or they don't have enough contrast. In both cases, they are not useful as features. So we get rid of them. The approach is similar to the one used in the Harris Corner Detector for removing edge features. For low contrast features, we simply check their intensities. Removing low contrast features This is simple. If the magnitude of the intensity (i.e., without sign) at the current pixel in the DoG image (that is being checked for minima/maxima) is less than a certain value, it is rejected. Because we have subpixel keypoints (we used the Taylor expansion to refine keypoints), we again need to use the taylor expansion to get the intensity value at subpixel locations. If it's magnitude is less than a certain value, we reject the keypoint. Removing edges The idea is to calculate two gradients at the keypoint. Both perpendicular to each other. Based on the image around the keypoint, three possibilities exist. The image around the keypoint can be:  A flat region: If this is the case, both gradients will be small.  An edge: Here, one gradient will be big (perpendicular to the edge) and the other will be small (along the edge)  A "corner": Here, both gradients will be big.
  • 15. 15 Corners are great keypoints. So we want just corners. If both gradients are big enough, we let it pass as a key point. Otherwise, it is rejected. Mathematically, this is achieved by the Hessian Matrix. Using this matrix, you can easily check if a point is a corner or not. If you're interested in the math, first check the posts on the Harris corner detector. A lot of the same math used used in SIFT. In the Harris Corner Detector, two eigenvalues are calculated. In SIFT, efficiency is increased by just calculating the ratio of these two eigenvalues. You never need to calculate the actual eigenvalues. Example Here's a visual example of what happens in this step: Both extrema images go through the two tests: the contrast test and the edge test. They reject a few keypoints (sometimes a lot) and thus, we're left with a lower number of keypoints to deal with. Summary In this step, the number of keypoints was reduced. This helps increase efficiency and also the robustness of the algorithm. Keypoints are rejected if they had a low contrast or if they were located on an edge.
  • 16. 16 In the next step we'll assign an orientation to all the keypoints that passed both tests. SIFT: Keypoint orientations After step 4, we have legitimate key points. They've been tested to be stable. We already know the scale at which the keypoint was detected (it's the same as the scale of the blurred image). So we have scale invariance. The next thing is to assign an orientation to each keypoint. This orientation provides rotation invariance. The more invariance you have the better it is. :P The idea The idea is to collect gradient directions and magnitudes around each keypoint. Then we figure out the most prominent orientation(s) in that region. And we assign this orientation(s) to the keypoint. Any later calculations are done relative to this orientation. This ensures rotation invariance. The size of the "orientation collection region" around the keypoint depends on it's scale. The bigger the scale, the bigger the collection region. The details Now for the little details about collecting orientations.
  • 17. 17 Gradient magnitudes and orientations are calculated using these formulae: The magnitude and orientation is calculated for all pixels around the keypoint. Then, A histogram is created for this. In this histogram, the 360 degrees of orientation are broken into 36 bins (each 10 degrees). Lets say the gradient direction at a certain point (in the "orientation collection region") is 18.759 degrees, then it will go into the 10-19 degree bin. And the "amount" that is added to the bin is proportional to the magnitude of gradient at that point. Once you've done this for all pixels around the keypoint, the histogram will have a peak at some point. Above, you see the histogram peaks at 20-29 degrees. So, the keypoint is assigned orientation 3 (the third bin) Also, any peaks above 80% of the highest peak are converted into a new keypoint. This new keypoint has the same location and scale as the original. But it's orientation is equal to the other peak. So, orientation can split up one keypoint into multiple keypoints.
  • 18. 18 The Technical Details Magnitudes Saw the gradient magnitude image above? In SIFT, you need to blur it by an amount of 1.5*sigma. Size of the window The window size, or the "orientation collection region", is equal to the size of the kernel for Gaussian Blur of amount 1.5*sigma. Summary To assign an orientation we use a histogram and a small region around it. Using the histogram, the most prominent gradient orientation(s) are identified. If there is only one peak, it is assigned to the keypoint. If there are multiple peaks above the 80% mark, they are all converted into a new keypoint (with their respective orientations). Next, we generate a highly distinctive "fingerprint" for each keypoint. Here's a little teaser. This fingerprint, or "feature vector", has 128 different numbers.
  • 19. 19 SIFT: Generating a feature Now for the final step of SIFT. Till now, we had scale and rotation invariance. Now we create a fingerprint for each keypoint. This is to identify a keypoint. If an eye is a keypoint, then using this fingerprint, we'll be able to distinguish it from other keypoints, like ears, noses, fingers, etc. The idea We want to generate a very unique fingerprint for the keypoint. It should be easy to calculate. We also want it to be relatively lenient when it is being compared against other keypoints. Things are never EXACTLY same when comparing two different images. To do this, a 16x16 window around the keypoint. This 16x16 window is broken into sixteen 4x4 windows. Within each 4x4 window, gradient magnitudes and orientations are calculated. These orientations are put into an 8 bin histogram.
  • 20. 20 Any gradient orientation in the range 0-44 degrees add to the first bin. 45-89 add to the next bin. And so on.And (as always) the amount added to the bin depends on the magnitude of the gradient. Unlike the past, the amount added also depends on the distance from the keypoint. So gradients that are far away from the keypoint will add smaller values to the histogram. This is done using a "gaussian weighting function". This function simply generates a gradient (it's like a 2D bell curve). You multiple it with the magnitude of orientations, and you get a weighted thingy. The farther away, the lesser the magnutide. Doing this for all 16 pixels, you would've "compiled" 16 totally random orientations into 8 predetermined bins. You do this for all sixteen 4x4 regions. So you end up with 4x4x8 = 128 numbers. Once you have all 128 numbers, you normalize them (just like you would normalize a vector in school, divide by root of sum of squares). These 128 numbers form the "feature vector". This keypoint is uniquely identified by this feature vector. You might have seen that in the pictures above, the keypoint lies "in between". It does not lie exactly on a pixel. That's because it does not. The 16x16 window takes orientations and magnitudes of the image "in-between" pixels. So you need to interpolate the image to generate orientation and magnitude data "in between" pixels.
  • 21. 21 Problems This feature vector introduces a few complications. We need to get rid of them before finalizing the fingerprint. 1. Rotation dependence The feature vector uses gradient orientations. Clearly, if you rotate the image, everything changes. All gradient orientations also change. To achieve rotation independence, the keypoint's rotation is subtracted from each orientation. Thus each gradient orientation is relative to the keypoint's orientation. 2. Illumination dependence If we threshold numbers that are big, we can achieve achieve illumination independence. So, any number (of the 128) greater than 0.2 is changed to 0.2. This resultant feature vector is normalized again. And now you have an illumination independent feature vector! Summary You take a 16x16 window of "in-between" pixels around the keypoint. You split that window into sixteen 4x4 windows. From each 4x4 window you generate a histogram of 8 bins. Each bin corresponding to 0-44 degrees, 45-89 degrees, etc. Gradient orientations from the 4x4 are put into these bins. This is done for all 4x4 blocks. Finally, you normalize the 128 values you get. To solve a few problems, you subtract the keypoint's orientation and also threshold the value of each element of the feature vector to 0.2 (and normalize again). The End! Once you have the features, you go play with them! I'll get to that in a later post(or posts :P). Read up on how the hough transform works. It will be used a lot.
  • 22. 22 Harris Corner Detector The Harris Corner Detector is a mathematical operator that finds features (what are features?) in an image. It is simple to compute, and is fast enough to work on computers. Also, it is popular because it is rotation, scale and illumination variation independent. However, the Shi-Tomasi corner detector, the one implemented in OpenCV, is an improvement of this corner detector. The mathematics To define the Harris corner detector, we have to go into a bit of math. We'll get into a bit of calculus, some matrix math, but trust me, it won't be tough. I'll make everything easy to understand! Our aim is to find little patches of image (or "windows") that generate a large variation when moved around. Have a look at this image:
  • 23. 23 Markedareas have a lot variation The red square is the window we've chosen. Moving it around doesn't show much of variation. That is, the difference between the window, and the original image below it is very low. So you can't really tell if the window "belongs" to that position. Of course, if you move the window too much, like onto the reddish region, you're bound to see a big difference. But we've moved the window too much. Not good. Now have a look at this:
  • 24. 24 Regions with extremelyhigh variation See? Even the little movement of the window produces a noticeable difference. This is the kind of window we're looking for. Here's how it translates mathematically: The equation  E is the difference between the original and the moved window.  u is the window's displacement in the x direction  v is the window's displacement in the y direction  w(x, y) is the window at position (x, y). This acts like a mask. Ensuring that only the desired window is used.  I is the intensity of the image at a position (x, y)  I(x+u, y+v) is the intensity of the moved window  I(x, y) is the intensity of the original We've looking for windows that produce a large E value. To do that, we need to high values of the terms inside the square brackets. (Note: There's a little error in these equations. Can you figure it out? Answer below!) So, we maximize this term: Then, we expand this term using the Taylor series. Whats that? It's just a way of rewriting this term in using its derivatives.
  • 25. 25 See how the I(x+u, y+v) changed into a totally different form ( I(x,y)+uIx + vIy )? Thats the Taylor series in action. And because the Taylor series is infinite, we've ignored all terms after the first three. It gives a pretty good approximation. But it isn't the actual value. Next, we expand the square. The I(x,y) cancels out, so its just two terms we need to square. It looks like this: Now this messy equation can be tucked up into a neat little matrix form like this: See how the entire equation gets converted into a neat little matrix! (The error: There's no w(x, y) in these errors :P ) Now, we rename the summed-matrix, and put it to be M: So the equation now becomes: Looks so neat after all the clutter we did above. Interesting windows It was figured out that eigenvalues of the matrix can help determine the suitability of a window. A score, R, is calculated for each window:
  • 26. 26 All windows that have a score R greater than a certain value are corners. They are good tracking points. Summary The Harris Corner Detector is just a mathematical way of determining which windows produce large variations when moved in any direction. With each window, a score R is associated. Based on this score, you can figure out which ones are corners and which ones are not. OpenCV implements an improved version of this corner detector. It is called the Shi- Tomasi corner detector.
  • 27. 27 Features: What are they? Several computer vision tasks require finding matching points across several frames or views. With that info, you could really do a lot of stuff. An example. When doing stereo imaging, you want to know a few corresponding points between the two views. Once you do, you can triangulate almost all points on the image (just like the brain does!). The first approach: Patches Intuitively, you'd be tempted to match small "patches" between the two images. Something like this: You want to find the left image's green path in the right. And you can do that quite easily. The white thingy makes the patch quite unique. Even something as trivial as template matching would be able to find it. But, not all patches are so uniquely recognizable. Check the patches below:
  • 28. 28 There are no unique "features" to identify on the wall. So you'll have problem finding corresponding points. So the patches approach isn't that great. Corners Corners in an image seem to be perfect for such tracking tasks. Here's an example image: These corners are perfect! Why?
  • 29. 29 Uniquely identifiable These points are uniquely identifiable. What do I mean by that? Here's what. Lets say you're trying to find the green corner in the right image (in the image below). You know that it'll be somewhere around the same location. So you can narrow down the "search region". And within this search region, there would be only one point that resembles the corner. Of course, the assumption here is that there isn't a massive difference between the two images. And this is usually a reasonable assumption. Stable These points usually don't keep moving around in the image. This helps tracking. And any motion of this point, even a little one, produces a large variation. You can clearly see the point moving around.
  • 30. 30 A bad feature I'll try to make the idea of a "corner" more concrete. We'll use some math to do this. How do you identify a bad feature? Something that doesn't have a lot of variation... like the example of the wall above. There are no edges or corners in the feature. So, the first derivative is flat, in both directions, x and y. Or, the first derivative is flat in all directions (all directions are a certain combination of the x and y component). So, the second derivative also does not change in any direction. An edge An edge is a bad feature as well. If you move in the direction of the edge, you won't even know you're moving. For example, if you move along the edge at the top of the building and the sky, you won't even realize it.
  • 31. 31 But if you move from the building to the sky (perpendicular to the edge) you'll "see" motion. This is the only direction where you can accurately tell how fast the object is moving. So edges aren't that useful as features. The first derivative changes in only the one direction (perpendicular to the edge). Some progress, but not that good. So, the second derivative also changes in only one direction. A corner A corner is an awesome feature! There's variation all around a corner. So, the derivative changes in all directions. So the second derivative also changes in all directions! Great! And you can write pretty efficient programs to calculate that at all points. So there you have it! Identifying good features: If the first derivative keeps changing around a point, you know you have a corner. And you also know you have a good feature to track!
  • 32. 32 So what exactly is a feature? By now, you've hopefully have an idea of what a feature is. At least intuitively. If not, go read the post again... because there are no formal definitions of a feature till now. :P Convolutions Convolutions is a technique for general signal processing. People studying electrical/electronics will tell you the near infinite sleepless nights these convolutions have given them. Entire books have been written on this topic. And the questions and theorems that need to be proved are [insurmountable]. But for computer vision, we'll just deal with some simple things. The Kernel A convolution lets you do many things, like calculate derivatives, detect edges, apply blurs, etc. A very wide variety of things. And all of this is done with a "convolution kernel". The convolution kernel is a small matrix. This matrix has numbers in each cell and has an anchor point:
  • 33. 33 This kernel slides over an image and does its thing. The "anchor" point is used to determine the position of the kernel with respect to the image. The transformation The anchor point starts at the top-left corner of the image and moves over each pixel sequentially. At each position, the kernel overlaps a few pixels on the image. Each overlapping pair of numbers is multiplied and added. Finally, the value at the current position is set to this sum. Here's an example:
  • 34. 34 The matrix on the left is the image and the one on the right is the kernel. Suppose the kernel is at the highlighted position. So the '9' of the kernel overlaps with the '4' of the image. So you calculate their product: 36. Next, '3' of the kernel overlaps the '3' of the image. So you multiply: 9. Then you add it to 36. So you get a sum of 36+9=45. Similarly, you do for all the remaining 7 overlapping values. You'll get a total sum. This sum is stored in place of '2' (in the image). Speed optimizations The most direct way to compute a convolution would be to use multiple for loops. But that causes a lot of repeated calculations. And as the size of the image and kernel increases, the time to compute the convolution increases too (quite drastically). Techniques haves been developed to calculate convolutions rapidly. One such technique is using the Discrete Fourier Transform. It converts the entire convolution operation into a simple multiplication. Fortunately, you don't need to know the math to do this in OpenCV. It automatically decides whether to do it in frequency domain (after the DFT) or not.
  • 35. 35 Problematic corners and edges The kernel is two dimensional. So you have problems when the kernel is near the edges or corners. Here's an example: If the kernel (in the above example) is on the top right position, the '0' of the kernel will be over the '3' in the image. But the '1' will be outside the image. So we have no idea what to do with it. Two things are possible:  Ignore the ones -or-  Do something about the edges Usually people choose to do something about it. They create extra pixels near the edges. There are a few ways to create extra pixels:  Set a constant value for these pixels  Duplicate edge pixels  Reflect edges (like a mirror effect)  Warp the image around (copy pixels from the other end) This usually fixes the problems that might arise. Summary You learned a powerful technique that can be used for a lot of different purposes. We'll see a few of those next.
  • 36. 36 Image convolution examples A convolution is very useful for signal processing in general. There is a lot of complex mathematical theory available for convolutions. For digital image processing, you don't have to understand all of that. You can use a simple matrix as an image convolution kernel and do some interesting things! Simple box blur Here's a first and simplest. This convolution kernel has an averaging effect. So you end up with a slight blur. The image convolution kernel is: Note that the sum of all elements of this matrix is 1.0. This is important. If the sum is not exactly one, the resultant image will be brighter or darker. Here's a blur that I got on an image: A simple blur done with convolutions
  • 37. 37 Gaussian blur Gaussian blur has certain mathematical properties that makes it important for computer vision. And you can approximate it with an image convolution. The image convolution kernel for a Gaussian blur is: Here's a result that I got: Line detection with image convolutions With image convolutions, you can easily detect lines. Here are four convolutions to detect horizontal, vertical and lines at 45 degrees:
  • 38. 38 I looked for horizontal lines on the house image. The result I got for this image convolution was: Edge detection The above kernels are in a way edge detectors. Only thing is that they have separate components for horizontal and vertical lines. A way to "combine" the results is to merge the convolution kernels. The new image convolution kernel looks like this: Below result I got with edge detection:
  • 39. 39 The Sobel Edge Operator The above operators are very prone to noise. The Sobel edge operators have a smoothing effect, so they're less affected to noise. Again, there's a horizontal component and a vertical component. On applying this image convolution, the result was: The laplacian operator
  • 40. 40 The laplacian is the second derivative of the image. It is extremely sensitive to noise, so it isn't used as much as other operators. Unless, of course you have specific requirements. Here's the result with the convolution kernel without diagonals: The Laplacian of Gaussian The laplacian alone has the disadvantage of being extremely sensitive to noise. So, smoothing the image before a laplacian improves the results we get. This is done with a 5x5 image convolution kernel.
  • 41. 41 The result on applying this image convolution was: Summary You got to know about some important operations that can be approximated using an image convolution. You learned the exact convolution kernels used and also saw an example of how each operator modifies an image. I hope this helped! Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer (an eye or a camera) and the scene. Meanshift The intuition behind the meanshift is simple. Consider you have a set of points. (It can be a pixel distribution like histogram backprojection). You are given a small window ( may be a circle) and you have to move that window to
  • 42. 42 the area of maximum pixel density (or maximum number of points). It is illustrated in the simple image given below: image The initial window is shown in blue circle with the name "C1". Its original center is marked in blue rectangle, named "C1_o". But if you find the centroid of the points inside that window, you will get the point "C1_r" (marked in small blue circle) which is the real centroid of window. Surely they don't match. So move your window such that circle of the new window matches with previous centroid. Again find the new centroid. Most probably, it won't match. So move it again, and continue the iterations such that center of window and its centroid falls on the same location (or with a small desired error). So finally what you obtain is a window with maximum pixel distribution. It is marked with green circle, named "C2". As you can see in image, it has maximum number of points. The whole process is demonstrated on a static image below:
  • 43. 43 image So we normally pass the histogram backprojected image and initial target location. When the object moves, obviously the movement is reflected in histogram backprojected image. As a result, meanshift algorithm moves our window to the new location with maximum density.
  • 44. 44 The Shi-Tomasi Corner Detector The Shi-Tomasi corner detector is based entirely on the Harris corner detector. However, one slight variation in a "selection criteria" made this detector much better than the original. It works quite well where even the Harris corner detector fails. So here's the minor change that Shi and Tomasi did to the original Harris corner detector. The change The Harris corner detector has a corner selection criteria. A score is calculated for each pixel, and if the score is above a certain value, the pixel is marked as a corner. The
  • 45. 45 score is calculated using two eigenvalues. That is, you gave the two eigenvalues to a function. The function manipulates them, and gave back a score. You can read more about how interesting windows in the Harris corner detector are selected. Shi and Tomasi suggested that the function should be done away with. Only the eigenvalues should be used to check if the pixel was a corner or not. The score for Harris corner detector was calculated like this (R is the score): For Shi-Tomasi, it's calculated like this: In their paper, Shi and Tomasi demonstrated experimentally that this score criteria was much better. If R is greater than a certain predefined value, it can be marked as a corner. Thus, the effect region for a point to be a corner is something like this:
  • 46. 46  Green: both λ1 and λ2 are greater than a certain value. Thus, this region is for pixels "accepted" as corners.  In the blue and gray regions, either λ1 or λ2 is less than he required minimum.  In the red region, both λ1 and λ2 are less than the required minimum. Compare the above with a similar graph for Harris corner detector... You'll see the blue and gray areas are equivalent to the "edge" areas. The red region is for "flat" areas. The green is for corners. Summary The Shi-Tomasi corner detector is a complete ripoff of the Harris corner detector, except for a minor change they did :P However, it is much better than the original corner detector, so people use it a lot more. Also, OpenCV implements the Shi-Tomasi corner detection algorithm.