Kiretu

To reconstruct a point cloud using Kinect, it is important to understand how reconstruction works in theory. Several steps are necessary but almost all of them base upon one fundamental formula, which will be derived in the following. It is taken from [1] and [2] (German).
Because it can be hard to understand the derivation for beginners, I tried to explain exery step in detail.
We start with the model of a pinhole camera [3]. Let be a point in space which is mapped on the point in the image plane.
To simplify the model, we mirror the image plane along the axis in front of the camera between the optical center and the point :
There are two coordinatesystems: The camera coordinate system and the image coordinate system . Note that the coordinates of and are arbitrary but fixed, so don’t mix them up with the coordinate systems.
We look at the scene above from the side:
is the distance between the optical center and the image plane and called the focal length. Due to the intercept theorem we get the following equations:
The can combine these two equation to a vector:
We now take a quick look at the most important characteristics of homogeneous coordinates. If you’ve never heard of homogeneous coordinates, you should catch up this topic.
Map a point to its homogeneous coordinates:
Equivalence of homogeneous coordinates:
You should keep these relations in mind.
We now map our points to their homogeneous coordinates as in (2):
This leads us to our equation (1) in homogeneous coordinates
where represents the factor of (3). We can write this equation in the following way:
We assumed that the origin of the image coordinate system is located in the images’s center, so far. But often that is not the case. Therefore, we add an offset to the image point:
Until now we used and in units of length, which is not appropriate to pixelrelated digital images. Hence, we introduce and , which are the number of pixels per unit of length ([pixel/length]) in  and direction. We then get pixels as unit:
This leads to
with the camera matrix
As a last step, we have to consider the different position and orientation of the depth and RGBcamera. To combine the two resulting coordinate systems, we use a transformation which includes a rotation and a transformation. This can be written as the following matrix:
We can now extend our equation (5) to:
The fundamental formula
describes the relation between a threedimensional point in space captured by a camera and its equivalent, twodimensional point at the image plane.
The parameters are called:
It is important to understand, that our model (excepted the transformation) has been derived for one general camera. In context of Kinect you have seperate intrinsics for the depth and RGBcamera!
In addation, we used the transformation to combine the coordinate systems of the depth and RGBcamera. That implicates, that we only have got one transformationmatrix.
The application of our model/formula in context of Kinect is explained at classdescription.