Kiretu: Kiretu: Reconstruction

To reconstruct a point cloud using Kinect, it is important to understand how reconstruction works in theory. Several steps are necessary but almost all of them base upon one fundamental formula, which will be derived in the following. It is taken from [1] and [2] (German).

Because it can be hard to understand the derivation for beginners, I tried to explain exery step in detail.

Pinhole camera

We start with the model of a pinhole camera [3]. Let $P = (X,Y,Z)^T \in \mathbb{R}^3$ be a point in space which is mapped on the point $p = (x,y)^T \in \mathbb{R}^2$ in the image plane.

To simplify the model, we mirror the image plane along the $Z$ -axis in front of the camera between the optical center and the point $P$ :

There are two coordinate-systems: The camera coordinate system $XYZ$ and the image coordinate system $ xy $ . Note that the coordinates of $ P = (X,Y,Z)^T $ and $p = (x,y)^T$ are arbitrary but fixed, so don’t mix them up with the coordinate systems.

We look at the scene above from the side:

$ f $ is the distance between the optical center and the image plane and called the focal length. Due to the intercept theorem we get the following equations:

$\frac{y}{f} = \frac{Y}{Z} \quad\Leftrightarrow\quad y = \frac{fY}{Z} \qquad\text{and}\qquad \frac{x}{f} = \frac{X}{Z} \quad\Leftrightarrow\quad x = \frac{fX}{Z}$

The can combine these two equation to a vector:

$\begin{equation} \begin{pmatrix} x \\ y \end{pmatrix} = \begin{pmatrix} fX/Z \\ fY/Z \end{pmatrix} \end{equation}$

Homogeneous coordinates

We now take a quick look at the most important characteristics of homogeneous coordinates. If you’ve never heard of homogeneous coordinates, you should catch up this topic.

Map a point to its homogeneous coordinates:

$\begin{equation} P = \underbrace{\begin{pmatrix} x \\ y \end{pmatrix}}_{\in \mathbb{R}^2} \mapsto \underbrace{\begin{pmatrix} x \\ y \\ 1 \end{pmatrix}}_{\in \mathbb{R}^3} = \tilde{P} \end{equation}$
Equivalence of homogeneous coordinates:

$\begin{equation} \tilde{P} = \begin{pmatrix} x \\ y \\ z \end{pmatrix} = \lambda \begin{pmatrix} x \\ y \\ z \end{pmatrix} = \begin{pmatrix} \lambda x \\ \lambda y \\ \lambda z \end{pmatrix} = \lambda \tilde{P} \quad(\lambda \in \mathbb{R}\setminus\{0\}) \end{equation}$
Get a point given in homogeneous coordinates $\tilde{P}$ :
$\begin{equation} \tilde{P} = \begin{pmatrix} x \\ y \\ z \end{pmatrix} = \begin{pmatrix} x/z \\ y/z \\ 1 \end{pmatrix} \mapsto \begin{pmatrix} x/z \\ y/z \end{pmatrix} = P \end{equation}$

You should keep these relations in mind.

We now map our points to their homogeneous coordinates as in (2):

$P = \begin{pmatrix} X \\ Y \\ Z \end{pmatrix} \mapsto \begin{pmatrix} X \\ Y \\ Z \\ 1 \end{pmatrix} = \tilde{P},\quad p = \begin{pmatrix} x \\ y \end{pmatrix} \mapsto \begin{pmatrix} x \\ y \\ 1 \end{pmatrix} = \tilde{p}.$

This leads us to our equation (1) in homogeneous coordinates

$\begin{pmatrix} x \\ y \end{pmatrix} = \begin{pmatrix} fX/Z \\ fY/Z \end{pmatrix} \quad\Rightarrow\quad \begin{pmatrix} x \\ y \\ 1 \end{pmatrix} = s \begin{pmatrix} fX/Z \\ fY/Z \\ 1 \end{pmatrix} \stackrel{\text{(3)}}{=} s \begin{pmatrix} fX \\ fY \\ Z \end{pmatrix},$

where $s \in \mathbb{R} \setminus \{0\}$ represents the factor $\lambda$ of (3). We can write this equation in the following way:

$\begin{pmatrix} x \\ y \\ 1 \end{pmatrix} = s \begin{pmatrix} f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0 \end{pmatrix} \begin{pmatrix} X \\ Y \\ Z \\ 1 \end{pmatrix}$

Principal point offset

We assumed that the origin of the image coordinate system is located in the images’s center, so far. But often that is not the case. Therefore, we add an offset to the image point:

$\begin{pmatrix} x \\ y \\ 1 \end{pmatrix} = s \begin{pmatrix} fX/Z + \hat{c}_x \\ fY/Z + \hat{c}_y \\ 1 \end{pmatrix} = s \begin{pmatrix} fX + \hat{c}_xZ \\ fY + \hat{c}_yZ \\ Z \end{pmatrix} = s \begin{pmatrix} f & 0 & \hat{c}_x & 0 \\ 0 & f & \hat{c}_y & 0 \\ 0 & 0 & 1 & 0 \end{pmatrix} \begin{pmatrix} X \\ Y \\ Z \\ 1 \end{pmatrix}.$

Pixels at unit

Until now we used $f, \hat{c}_x$ and $\hat{c}_y$ in units of length, which is not appropriate to pixel-related digital images. Hence, we introduce $ k_x $ and $ k_y $ , which are the number of pixels per unit of length ([pixel/length]) in $x$ - and $y$ -direction. We then get pixels as unit:

$\begin{align*} f_x &= k_x \cdot f & c_x &= k_x \cdot \hat{c}_x \\ f_y &= k_y \cdot f & c_y &= k_y \cdot \hat{c}_y \end{align*}$

This leads to

$\begin{equation} \begin{pmatrix} x \\ y \\ 1 \end{pmatrix} = s \begin{pmatrix} f_x & 0 & c_x & 0 \\ 0 & f_y & c_y & 0 \\ 0 & 0 & 1 & 0 \end{pmatrix} \begin{pmatrix} X \\ Y \\ Z \\ 1 \end{pmatrix} \end{equation}$

with the camera matrix

$\mathbf{C} = \begin{pmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{pmatrix}.$

Transformation

As a last step, we have to consider the different position and orientation of the depth- and RGB-camera. To combine the two resulting coordinate systems, we use a transformation which includes a rotation and a transformation. This can be written as the following matrix:

$\mathbf{T} = \begin{pmatrix} r_{11} & r_{12} & r_{13} & t_1 \\ r_{21} & r_{22} & r_{23} & t_2 \\ r_{31} & r_{32} & r_{33} & t_3 \end{pmatrix}$

We can now extend our equation (5) to:

$\begin{align*} \begin{pmatrix} x \\ y \\ 1 \end{pmatrix} &= s \underbrace{ \begin{pmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{pmatrix} } \underbrace{ \begin{pmatrix} r_{11} & r_{12} & r_{13} & t_1 \\ r_{21} & r_{22} & r_{23} & t_2 \\ r_{31} & r_{32} & r_{33} & t_3 \end{pmatrix} } \begin{pmatrix} X \\ Y \\ Z \\ 1 \end{pmatrix} \\ \nonumber\tilde{p} \hspace{0.9em} &= s \hspace{2.8em} \mathbf{C} \hspace{6.8em} \mathbf{T} \hspace{5.0em} \tilde{P} \end{align*}$

Summary

The fundamental formula

$\begin{align} \begin{pmatrix} x \\ y \\ 1 \end{pmatrix} &= s \underbrace{ \begin{pmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{pmatrix} } \underbrace{ \begin{pmatrix} r_{11} & r_{12} & r_{13} & t_1 \\ r_{21} & r_{22} & r_{23} & t_2 \\ r_{31} & r_{32} & r_{33} & t_3 \end{pmatrix} } \begin{pmatrix} X \\ Y \\ Z \\ 1 \end{pmatrix} \\ \nonumber\tilde{p} \hspace{0.9em} &= s \hspace{2.8em} \mathbf{C} \hspace{6.8em} \mathbf{T} \hspace{5.0em} \tilde{P} \end{align}$

describes the relation between a threedimensional point in space $P = (X,Y,Z)^T \in \mathbb{R}^3$ captured by a camera and its equivalent, twodimensional point $p = (x,y)^T \in \mathbb{R}^2$ at the image plane.

The parameters are called:

$\mathbf{C}$ : intrisic parameters or intrinsics
$\mathbf{T}$ : extrinsic parameters or extrinsics

It is important to understand, that our model (excepted the transformation) has been derived for one general camera. In context of Kinect you have seperate intrinsics for the depth- and RGB-camera!

In addation, we used the transformation to combine the coordinate systems of the depth- and RGB-camera. That implicates, that we only have got one transformation-matrix.

The application of our model/formula in context of Kinect is explained at class-description.

[1]

Hartley, Richard and Zisserman, Andres: Multiple View Geometry. Slides CVPR-Tutorial, 1999. http://users.cecs.anu.edu.au/~hartley/Papers/CVPR99-tutorial/tutorial.pdf

[2]

Kläser, Alexander: Kamerakalibrierung und Stereo Vision. Written report MIBI-Seminar, FH Bonn-Rhein-Sieg, 2005. http://www2.inf.fh-bonn-rhein-sieg.de/mi/lv/smibi/ss05/stud/klaeser/klaeser_ausarbeitung.pdf

[3]

Wikipedia (engl.): Pinhole camera model. http://en.wikipedia.org/wiki/Pinhole_camera_model [2012-01-22]

Author:: Daniel Wunderlich (d.wunderlich@stud.uni-heidelberg.de)

Date:: 2012-01-26