This post is about my undergraduate final year project. I chose to implement an augmented reality application in Java. According to Wikipedia, augmented reality is augmentation of real world view with computer generated information such as sound or graphics. Here, the concept is about putting 3D graphics on a live video feed according to the position and orientation of a specific object, called marker, in the field of view of the camera to create the illusion of an object that’s actually not there. In short, the program captures images from webcam, detects marker, renders 3D object “on” the marker and shows the new image. There are captured screenshots from the output of the program at the end of the article. Using multiple templates (markers) and choosing a 3D object for each template is possible so that a specific 3D object will be rendered for a specific template.

The program consists of 11 classes. Three third party libraries is used for, getting images from webcam, solving some mathematical equations and rendering 3D objects. These libraries are Web Cam Capture API, EJML and Java 3D, respectively. As shown in the figure, the main workflow of the program consists of following steps which are conducted in real time: capture image from camera, search for each template in the image and detect if exists one, solve for 3D transformation that maps the position and orientation of the template to the position and orientation of the one that is found in the image, place the 3D object that corresponds to the template and transform it according to the 3D transformation that is solved for, render the 3D object on the image and finally display the new image. There are also classes for saving and loading templates and loading 3D objects.

The program mainly runs on the class named **LiveCam**. A separate thread is defined here, so that AR operations don’t interfere with the user interface and the buttons can be clicked. When the button **Augmented Reality** is clicked, the main loop in the **run** method starts. The same button also makes the variable named **recognitionStarted** true. This variable makes sure that no 3D object is rendered if no template is recognized. When the recognition starts, the first thing done is to search for templates in the captured image. If a template is found, the variable named **templateRecognized** is made true. If no template is found, the variable recognitionStarted is made false and the template searching process is started again. If a template (marker) is recognized, this means that we have the four outermost corners of the template (assuming the templates are rectangular). To achieve real-time performance, these four corners are tracked across successive frames instead of trying to detect the templates again and again in each frame. Of course, this brings a trade-off between robustness and speed. If we lose at least one of the corners, we remove the 3D object from the scene and start the template searching process again. Next, the corners in the captured frame that correspond to four outermost corners of the recognized template are marked. This is for debugging purposes and can be omitted. After finding the corresponding corners we have four pairs of points and that means we know which corner in the captured frame matches with the upper-left corner of the template, which corner matches with the upper-right corner of the template, so on. Now that we have four pairs of points, we can calculate the homography matrix. Homography matrix lets us determine the perspective projection that projects our template onto the captured frame. That is, if we would apply this transformation to our untransformed template, we would have translated and rotated it so that it’s positioned exactly at the position and the orientation where it’s on the captured frame. So, we now have our homography matrix and need to get translations and rotations out of it so that we can transform and place our 3D object on the position where the template is in the captured frame. Finally, we transform the 3D object according to these transformations. That’s what the class LiveCam does in summary. Now, details are given.

The template recognition is done via the class **TemplateRecognition**. When the method recognizeTemplateInImage is called, edges and corners of the current frame is detected. Then, the templates serialized before are loaded from the file system. This actually causes a small delay at the beginning each time the template recognition process runs and can be modified so that the templates are loaded only once at the beginning of the prgram by simply calling the method loadTemplates in the constructor. Next, each template is compared with the current frame and checked if any of them is present. If a template is present, an Object array which contains four outer corners of the template, four outer corners of the image and the name of the template found is returned.

The method **recognizeTemplateInImage** calls the method named **findMatchingPoints** in the class **ObjectRecognition** for detection of the current template. Here, the object recognition is achieved by finding the corners of the untransformed template that match to the corners of the template in the current frame. A natural way to do this would be randomly selecting 3 points from both the untransformed template and the template in the frame, finding the affine transformation that match these point sets, applying this transformation to other points of the untransformed template and counting the number of point sets that coincide. However, it would take a long time to find the affine transformation that match the majority of the point sets in this way. Because our application must run on real time, we couldn’t afford this delay. A better way is to select the points from a set of points that are connected together with an edge. That is, the 3 points are selected this time from a connected component. This is shown in the following figure in which the connected edge regions are of the same color. If we select the 3 points from a large set, say 5 points that are connected together, we increase our chance of finding the correct match. Because there won’t be many connected components that have exactly 5 points both in the untransformed template and in the frame, assuming there’s not much distractions in the background. The variable named **cornersInAGroup** tells us the current number of points in the component(s) we’re interested in. Of course, there could be more than one connected component which have exactly five points. But still, this technique will speed up the process. Extracting connected corners from the labeled edge regions is done in the method **findConnectedCorners**. Labeling of the connected edge regions is done in the class **EdgeDetection**.

Obviously we shouldn’t select these 3 point sets randomly, because some points may be selected more than once. Also, the order of the selected points is important. Assume that we selected 3 points randomly from a connected component which has 5 points, in the untransformed template. Then we’ve selected 3 points from a connected component in the frame, that has the same number of points as the one in the untransformed template, randomly again. The probability of them being the corresponding set of points is **1/(C(5,3) * P(3,3)) = 1/360** which is a very low probability and not affordable in real time. However, if we don’t choose the same triple more than once and try all the permutations of it, we increase our chances. It’s basically the same thing as choosing 3 points from the 5 points in order. With the former option we also have to know which triple we selected before. So, the latter option is preferred here. In short, we first select 3 points from a connected component that has cornersInAGroup number of points in it from the untransformed template. These 3 points are fixed. Then we try all possible combinations and permutations of 3 points from a connected component that has again cornersInAGroup number of points of the template in the frame and calculate the affine transformation that matches these 3 point sets. If this transformation matches other points of the template in the frame to the ones in the untransformed template as well, we continue with the next step. Otherwise, we try new combinations. If no matching is found, we do nothing.

Once we’ve selected our triple, we must find the corresponding affine transformation. Affine transformation is a linear mapping that preserves points, straight lines and planes. Sets of parallel lines remain parallel after an affine transformation. Affine transformation consists of translation, scale, shear and rotation. The reason of using affine transformation instead of projective transformation (homography) in the template detection phase is simply performance constraints. I didn’t try calculating homography here, but I think that calculating homography each time trying a new combination or permutation of 3 points is going to slow down the process. It could also happen that in the first frame after the detection phase, the points we should track are lost because of the delay between the frames. Anyway, the affine transformation can be represented with a 2×3 matrix, which, when applied to the 3×3 matrix representing the coordinates of the 3 points, gives a 2×3 matrix representing the coordinates of the transformed points.

\(\begin{bmatrix} a& b& c\\ d& e& f\end{bmatrix}\times \begin{bmatrix} x_1& x_2& x_3\\ y_1& y_2& y_3\\ 1& 1& 1\end{bmatrix}=\begin{bmatrix} u_1& u_2& u_3\\ v_1& v_2& v_3\end{bmatrix}\)

This is just the matrix form of the equations below.

\(\begin{matrix} ax_1 + by_1 + c = u_1& dx_1 + ey_1 + f = v_1\\ ax_2 + by_2 + c = u_2& dx_2 + ey_2 + f = v_2\\ ax_3 + by_3 + c = u_3& dx_3 + ey_3 + f = v_3\end{matrix}\)

Here, \(\left \{ \left ( x_1,y_1 \right ),\left ( x_2,y_2 \right ),\left ( x_3,y_3 \right ) \right \}\) and \(\left \{ \left ( u_1,v_1 \right ),\left ( u_2,v_2 \right ),\left ( u_3,v_3 \right ) \right \}\) are known and are image coordinates of the points from the untransformed template and template in the frame, respectively. So, we should solve for these equations and obtain the coefficients **a,…,f** which are the elements of the affine matrix. This is achieved in the method **solve3x3Equation** which uses the Cramer’s rule. As the coefficients **a,b,c** and **d,e,f** are independent, we can calculate them separately. Once we get the affine matrix, we use it to transform other points of the template in the image. Finally, we count the number of points of the untransformed template and the template in the frame that coincide using the Euclidean distance as metric and continue with the next step or try a new combination or permutation.

Now that we have point correspondences, we can calculate the homography. Homography is a transformation between two planes in projective space. We need this transformation to get the position and orientation of our marker in 3D world with respect to the untransformed marker. Actually one can say that homography consists of more than one transformation applied one after another to the points of interest. Basically, these transformations are rotation and translation in 3D world coordinates that transforms the plane (or marker) to its new position and orientation, perspective projection that maps 3D points on the plane to 2D points on the film of the camera and affine transformation that maps the 2D points on the film of the camera to the final 2D points we see on the screen. These transformations can be collected in one transformation. However, what we really want to do is not projecting points from 3D world space to 2D screen. We want to transform points on a plane in 3D space to 2D screen, to be precise. Now, we need two axes instead of one to represent our points, because they lie on a plane, although the plane is in 3D. The figure below shows the overall process.

Having 2D points, we can omit one component of the rotation matrix. Finally, we can collect perspective projection and internal parameters of the camera in one matrix, called the **intrinsic matrix**. Intrinsic matrix represents the internal parameters of the camera we used. These parameters are focal length(fx and fy), skew parameter(s) and optical center(cx and cy). More detailed information can be found here.

\(\begin{bmatrix} f_x& s& c_x\\ 0& f_y& c_y\\ 0& 0& 1\end{bmatrix}\)

We express homography transformation in matrix form below, where **x(w)** represents points on the plane in the world coordinate system, **x(c)** represents points on the image plane of the camera and **K** shows the intrinsic parameters of the camera.

\(\\x^{(c)}=K*\begin{bmatrix} r_1& r_2& r_3 & t\end{bmatrix}*\begin{bmatrix} x^{(w)}& y^{(w)}& 0 & 1\end{bmatrix}^T\\x^{(c)}=K*\begin{bmatrix} r_1& r_2& t\end{bmatrix}*\begin{bmatrix} x^{(w)}& y^{(w)}& 1\end{bmatrix}^T\\x^{(c)}=H*\begin{bmatrix} x^{(w)}& y^{(w)}& 1\end{bmatrix}^T\)

Finding the instrinsic camera matrix is called **camera calibration**. It is an offline process and can be performed using Matlab. Generally, a checkerboard image is photographed in different orientations holding the camera fixed or changing the orientation of the camera holding the checkerboard image fixed. I used the webcam of my notebook, so the camera is fixed. The figure below shows 6 of the 26 images used for camera calibration.

All of the images should be provided to Matlab. Actually, the more the checkerboard images the better. The code below can be used to calibrate the camera. Here, the variable **squareSize** is the length of one side of a square in checkerboard in millimeters. After the calibration, the intrinsic matrix can be obtained from the output variable params. However, Matlab gives the intrinsic matrix in a transposed format, so one may want to transpose it back before using it.

% get file names of the images as a cell array images = dir('*.jpg'); imageFileNames = {images(:).name}'; % detect checkerboard points in the image [imagePoints, boardSize] = detectCheckerboardPoints(imageFileNames); % Generate world coordinates of the corners of the squares. squareSize = 36; % millimeters worldPoints = generateCheckerboardPoints(boardSize, squareSize); % Calibrate the camera. [params, ~] = estimateCameraParameters(imagePoints, worldPoints);

After obtaining both the homography matrix and the intrinsic matrix, we can estimate the position and orientation of the marker in 3D world coordinates by simply multiplying the inverse of the intrinsic matrix with the homography matrix.

\(\\H=K*[R|T]\\K^{-1}*H=K^{-1}*K*[R|T]\\=[R|T]\)

Estimation of homography and determining the rotation and translation of the marker is performed in the method **solveHomography**. Homography estimation is almost a straightforward process, other than some optimizations which are not adopted here. We establish a matrix **A** and a vector **b** using point correspondences and solve for the vector **x** that gives the vector b when multiplied with the matrix A. We subtract 274 and 189 from the x and y coordinates of the unstransformed template respectively to make the 3D object appear in the middle of the marker. Because we assume the center of the world coordinate system as the top left corner of the untransformed marker, subtracting these values make the center of the coordinate system the center of the template. The 8 elements of the vector x form the 3×3 homography matrix with a 1 at the end. Then, we obtained the orientation matrix using the procedure mentioned above. The orientation matrix we obtained includes rotation around axis x and axis y and translation in x,y and z. Because the axes should be orthogonal, we can say that the rotation around axis z is simply the cross product of the vectors rx and ry. We then normalize our rotation and translation vectors to avoid ambiguities that emerge when we solve for the linear system. Finally, we return the matrix \([R|T]\)

Now, we should apply the rotation and translation to the object. In Java 3D units are meters, so we should convert them to centimeters by dividing the translation vector by 100. Next, we should get Euler angles from the rotation matrix. We could actually use the roation matrix itself as well. Finally, we apply rotations and translations to the 3D object and render it on the screen.

The program is begging for optimization. Here are some major ones that could be done:

- Rendering 3D objects and tracking should be run on separate threads,
- Loading templates from the file system should be done only once,
- GPU, SIMD instructions or JNI should be used for vector-matrix operations (maybe EJML does that, I don’t know)
- Faster methods, like HOG or SC, could be used for template recognition.
- For robustness, image patches could be defined around corners of the marker and used for tracking, like KLT

The parts **EdgeDetection** and **CornerDetection** are not mentioned to not to make the post too long. Probably they will be mentioned in one of the later posts for other projects.

The figure below shows some of the results.