So far, we have employed the following approach to moving an arm toward an object: identifying the object and its pose using SIFT, ICP, or a combination thereof, importing of its 3D model into a representation of the world, and planning a path using this representation. This approach is not only computational extensive, but is also little robust against dynamic changes in the environment. We will today introduce a complementary approach that uses feedback control to drive a robot arm into a desired position. This approach is complementary, as it can make up for small errors in positioning, but is not necessarily complete. In particular we will be treating

• formulating the inverse kinematics problem as control problem that uses visual feedback
• visual servo control in image space
• visual servo control in position space

This lecture is complemented by a tutorial on Visual Servo Control by Francois Chaumette and Seth Huthinson [Cha06] and uses the same notation.

# Visual Servoing: Basics

The key idea in visual servoing is to represent the difference between a robot’s gripper desired location and its actual location as an error in image space. This requires the robot to have both the target and the gripper in its field of view at all times. An example is shown below:

A Puma manipulator with eye-in-hand camera configuration looking at a target (JaViSS simulator)

Desired target position.

The robot’s goal is now to move its camera/gripper so that the target is, for example, centered in the image and upright (right). The question is now how the position of features (such as the colored dots) in the image can be related to the arm’s joint angles.  (These examples have been created using the JaViSS simulator, a Java applet.) To understand this, we will begin with relating the pose of the camera frame to the position of features in the image.

Let $s$ be a set of features in image space. This could be a column vector of all the pixel positions of the colored dots in the top left image. Let $s^*$ be the vector of the desired values for these features. This could be a column vector of all the pixel positions of the colored dots in the image to the right. We can now express the “error” between desired and actual position as

$e(t)=s(t)-s^*$

This error is a function of the camera pose, which has 6 dimensions: 3 translations and 3 rotations.

A simple proportional controller that will drive the error to zero eventually could look like

$\dot{e}=-\lambda e$

where $\lambda$ is the proportional gain. The underlying theory is a lecture on its own (“Control Theory”), but here some intuition: (1) as soon as $e=0$, $\dot{e}=0$. (2) as long as $e>0$, $\dot{e}<0$.  (3) Assuming there is a way to translate a desired $\dot{e}$ into a motion of the robot, $e$ will become smaller. (4) Eventually, $e$ will reach zero. (5) If $\lambda$ is too large, the robot’s motions might overshoot, even worsen the error, if $\lambda$ is too small, converging to zero error might take too long.

Thus, if we knew, how changes in the camera’s pose affect changes in the error, we could design a controller that drives this error to zero. This relationship is known as the feature Jacobian, also known as interaction matrix. Let the feature Jacobian be denoted $L_e$ and be a $kx6$ matrix, where $k$ are the dimensions of the feature vector (e.g., 3 points in an image result into 6 values/dimensions). We can then write

$\dot{e}=L_ev_c$

where $v_c=(v_x, v_y, v_z, \omega_x, \omega_y, \omega_z)$. As usual, we expect $L_e$ to contain the partial derivatives of the velocity components with respect to the changes to entries in the error vector.

We can now plug the control law $\dot{e}=-\lambda e$ into $\dot{e}=L_ev_c$ and obtain

$v_c=-\lambda L_e^+e$.

where $\lambda L_e^+$ is a pseudo-inverse  $L_e^+=(L_e^TL_e)^{-1}L_e^T$.  As it is usually impossible to calculate $L_e$ analytically (but relatively easy, for example, to estimate simply by performing a couple of motions in all possible directions and noting the changes to the error or to derive analytical equations based on a simplified camera moel), we write $\hat{L_e^+}$ to denote that we are using an estimate of the feature Jacobian.

## The Feature Jacobian

Using a perspective projection, we can assume that points in the image space $\bold{x}=(x,y)$ relate to 3D coordinates $\bold{X}=(X,Y,Z)$ as follows:

$x=X/Z$

$y=Y/Z$

as both horizontal and vertical distances shrink with growing distance $Z$. By taking the derivative of $\bold{x}$ (note: use both chain and product role), and employing $\bold{\dot{X}}=-v_c-\omega_c \times \bold{X}$, we obtain

$L_x=\left[\begin{array}{cccccc}-\frac{1}{Z} & 0 & \frac{x}{Z} & xy & -(1+x^2) & y\\0 & -\frac{1}{Z} & \frac{Y}{Z} & 1+y^2 & -xy & -x\end{array}\right]$

for

$\dot{x}=L_x v_c$

and therefore an expression how changes in the camera frame affect changes in the image frame. (For details on this derivation, please refer to [Cha06], equations 6-11) Note that $L_x$ contains $Z$, the image depth that is not available from standard cameras. When doing visual servoing, one needs to either estimate this depth, e.g., using the perceived size of an object as a reference or extract it from depth data such as RGB-D or stereo vision.

If one can express $e$ in image coordinates and can directly control $v_c$, the feedback controller will generate a trajectory that drives $e$ to zero.  As $L_x$ can become singular for some configurations and there exist local minima, usually more than 3 points are used.

There are multiple ways of estimating $\hat{L_e^+}$ that depend on how depth is extracted from the scene. If depth is not available, a popular method is to calculate $\hat{L_{e^*}^+}$ at the position where $e=e^*=0$. As can be seen in Figures 2-4, the choice of  $\hat{L_e^+}$ drastically affects the trajectories of system in image and joint space. For example, whereas points move in straight lines in the image, the arm might perform a wide curve that does not necessarily corresponds to the shortest path. This can only be mitigated by expressing the error not in image coordinates, but in 3D coordinates that can be either estimated from the 2D image or extracted from a stereo or RGB-D sensor. This approach is known as position-based visual servoing.

Now, simple feedback control inverse kinematics can be used to calculate the necessary changes in joint angles to achieve a desired motion of the camera frame.

# Visual Servo Control for Robots with Unknown Kinematics

As the relationship between the speed of the camera frame and the speed of the joints is again a velocity Jacobian, the feature and the velocity Jacobians can be multiplied into a single matrix of the form

$\dot{e}=J\dot{j}$

where $\dot{j}$ are the joint velocities. This is a particularly interesting formulation, when the kinematics of the robot are unknown. Lets assume (for simplicity) that we have a robot that control the velocity of “joint” 1 and “joint” 2. Let these velocities be $\Delta j_1$ and $\Delta j_2$. Lets assume (also for simplicity) that we can observe two values in an image $e_1$ and $e_2$. The relationship between $e$ and $j$ is then given by

$\left(\begin{array}{cc}e_1\\e_2\end{array}\right)=\left(\begin{array}{cc}J_{11} & J_{12}\\J_{21} & J_{22}\end{array}\right)\left(\begin{array}{c}\Delta j_1 \\ \Delta j_2\end{array}\right)$

We can now rewrite this expression as

$\left(\begin{array}{cc}e_1\\e_2\\ \vdots\end{array}\right)=\left(\begin{array}{cccc}\Delta j_1 & \Delta j_2 & 0 & 0\\0 & 0 & \Delta j_1 & \Delta j_2\\\vdots & \vdots & \vdots & \vdots\end{array}\right)\left(\begin{array}{c}J_{11} \\ J_{12} \\ J_{13} \\ J_{14}\end{array}\right)$

which can contain an arbitrary number of $e$ to $\Delta j$ pairs. Here, the $\Delta j$ can be random (or systematic motions) and the $e$ are the corresponding errors in image space. This equation can then be solved for the vector containing the entries of the Jacobian (using again the solving inverse), which corresponds to an optimal least-squares fit.

Note that this approach, estimating a matrix that best matches a series of observation, is very generic and can equally well be used for example when calculating the affine transformation that a set of SIFT features vote for.

# References

[Cha06] F. Chaumette and S. Hutchinson. Visual Servo Control Part I: Basic Approaches. IEEE Robotics and Automation Magazine 13(4):82-90, 2006.

### 2 Responses to Advanced Robotics #8: Visual Servo Control

1. Snowboardjorge says:

Cool video of Visual Servoing