Vision is one of the information-rich sensor system both humans and robots have available. Processing the wealth of information that is generated by vision sensors is still a key challenge, however. Key primitives in a vision system are low-level filters, such as convolution with Gaussian filters, and feature extraction. In “Introduction to Robotics”, we learned how to perform basic filtering and how to construct simple feature detectors that are trained/hard-coded for detecting simple patterns in the environment. Deploying such markers is not always feasible, however, and it is impossible to detect the markers when rotated or partially occluded – all things possible for humans. The goals of this lecture are to
- introduce Scale-Invariant Feature Transforms
- show how resulting scale, rotation (and to some extent brightness) invariant features can be used for object recognition
Scale-Invariant Feature Transforms
Scale-invariant feature transforms are a class of algorithms/signal-processing techniques that allow to extract features that are easily detectable across different scales (or distances to an object), independent of their rotation, and to some extent robust to affine transformations, i.e., views of the same object from different perspectives, and illumination changes. The most prominent in this class is the SIFT algorithm, which however has lost popularity due to closed-source and licensing cost, and has been replaced in the past with SURF (Speed-Up Robust Feature) , which is faster and freely available. As the math behind SURF is more involved, we focus on the intuition behind SIFT and encourage the reader to download and play with the various open-source implementations of SURF.
SIFT proceeds in multiple steps. Descriptions of the algorithm often include its application to object recognition, but these algorithms are independent of feature generation (see below).
- Generate multiple scaled versions of the same image by re-sampling every 2nd, 4th and so on pixel.
- Filtering each scaled picture with various Gaussian filters of different variance.
- Calculating the difference between pairs of filtered images. This is equivalent to a Difference-of-Gaussian (DoG) filter.
- Detecting local minima and maxima in the DoG images across different scales.
- Reject extrema that are along edges by looking at the second derivative (Hessian Matrix) around each extrema. Edges have a much larger principal curvature across them than along them.
- Assign a “magnitude” and “orientation” to each remaining extrema (keypoint). The magnitude is the squared difference between neighboring pixels and the orientation is the angle between magnitude along the y-axis vs. magnitude along the x-axis. These calculations are made for all pixels in a fixed neighborhood around the initial keypoint, e.g., in a 16×16 pixel neighborhood.
- Collect orientations of neighboring pixels in a histogram, e.g., 36 bins each covering 10 degrees. Maintain the orientation corresponding to the strongest peak and associate it with the keypoint.
- Repeat step 7, but for four 4×4 pixel areas around the keypoint in the image scale that has the most extreme minima/maxima. Here, only 8 bins are used for the orientation histogram. As there are 16 histograms in a 16×16 pixel area, the feature descriptor has 128 dimensions.
- The feature descriptor vector is normalized, tresholded, and again normalized to make it more robust against illumination changes.
Object Recognition using SIFT features
SIFT features of training images can be stored in a database and can be used to identify these objects in the future. This is done by finding all SIFT features in an image and comparing them with those in the database. This comparison is done by using the Euclidian distance as metric and search a k-d tree (with d=128). In order to make this approach robust, each object needs to be identified by at least 3 independent SIFT features. For this, each SIFT descriptor stores the location, scale and orientation of it relative to some common point on the object. This allows each detected feature to “vote” for the position of the object that it is most closely associated with in the database. This is done using a Hough-transform. For example, position (2 dimensions) and orientation (1 dimension) can be discretized into bins (30 degree width for orientation); bright spots in Hough-space then correspond to an object pose that has been identified by multiple features.
As the Hough-transform clustering is very rough,  suggests an iterative approach that consists of finding an affine transformation that is consistent with all features belonging to a cluster in the Hough transform. This affine transformation is found by a least-square fit. These averaged pose values can then be used to identify outliers, which can then be excluded from the process. A new affine transformation is then found until all features are in agreement.
- SIFT are powerful methods to extract scale, rotation, and brightness invariant features from images.
- SIFT features have 128 dimensions and are compared via their Euclidian distance. Nevertheless, object recognition can be very reliable by combining at least 3 features’ votes.
- SIFT has been a revolutionary method, but is in the process of being replaced by more advanced methods such as SURF.
 Lowe, D. G., “Distinctive Image Features from Scale-Invariant Keypoints”, International Journal of Computer Vision, 60, 2, pp. 91-110, 2004.
 Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van Gool “SURF: Speeded Up Robust Features”, Computer Vision and Image Understanding (CVIU), Vol. 110, No. 3, pp. 346-359, 2008.