With rapid development in the robotic area, many computer vision systems have been developed. Stereo vision is a computer vision technique that uses two aligned cameras to extract depth information from a scene. A depth map can be estimated by comparing the shift of the correspondent pixels from the two cameras. However, in real time stereo system the images have to be computed at a reasonable frame rate. Real time stereo vision allows the mobile robot platform to autonomous navigate and interact with the environment by providing the depth of the scene. Currently there are many solutions to solve the stereo correspondence problem. However, a few solutions were developed to work in real time mobile robots platforms, which the computation power and hardware are very limited. In this paper, we propose a low cost stereo vision system able to extract depth from moving object at frame rate of 30Hz. The proposed method is based in a local dense stereo correspondence algorithm that uses a prior foreground extraction to simplify the disparity. Additional this paper propose a method based in weight mean to improve the stability of the disparity image. The experiments show the robustness and good time performance of the proposed approaches.
Keywords: stereo vision, distance measurement, foreground segmentation
As the world progresses towards human comfort and automation, distance perception forms an integral part of automated navigation and motion. There are currently many techniques to determine the distance to an object. Some popular techniques include use of ultrasonic range finder, radar range finding, lasers range finding, etc. Most of these techniques involve recording of the time between transition and receiving of the signal. Another way for distance measurement is the use of a pair of cameras called as stereo vision camera system.
Stereo vision is an imaging technique that allows the reconstruction of point coordinates in three-dimensional space based on images acquired from two cameras. By detecting the same object in the corresponding frames, its (x, y, z) coordinates can be precisely determined. Stereo vision systems are widely used in many applications, such as navigation of autonomous mobile robots [1], 3D measurements [2], object tracking [3], the movie industry [4], augmented reality or people tracking and identification systems [5]. In most of these applications, the high accuracy of stereo vision is crucial for further processing steps.
Recently, several research works have done to develop multiple vision sensors for the purpose of object distance measurement and 3D Reconstruction. One great example is the work proposed by [6], where the author presents a stereo vision system for intelligent vehicles for distance measurement in order to avoid forward obstacles collisions. The distance measurement is based on the disparity of forward-facing obstacle in the two images acquired by the intelligent vehicle. Than in [7], the author proposed a method for enhancement of the disparity map, so that a more accurate object distance can be estimated. In their work, the authors combined three different cost metrics to acquire a reliable combined cost volume. The resulting cost volumes are then merged in a combined one, which is further refined by exploiting scanline optimization. The accuracy of this technique makes this methodology ranks first among published methods.
Currently there are many solutions to solve the stereo correspondence problem. However, a few solutions were developed to work in real time mobile robots platforms which the computation power and hardware are very limited. In this paper, we propose a low cost stereo technique able to extract depth from moving object at frame rate of 30Hz. The proposed method is based in a local dense stereo correspondence algorithm that uses a prior foreground extraction to simplify the disparity. Additional this paper propose a method based in weight mean to improve the stability of the disparity image.
Background
The 3D-Reconstruction process, which employs stereo vision, consists of determining the disparity between a pair of images, generated by two cameras placed in different locations in space. Knowing these locations, we can compute a 3D map of the scene by calculating disparities and generating a disparity map. Stereo vision, just like human eyes, deduces the distance from two images taken from different views. It is based that the same object will appear slightly in different locations in the left and the right eye of a human, which is recognized as disparity. The overlap from the two different views is used in biological vision to perceive depth. Stereo vision uses the same idea to perceive the three-dimensional structure of the world. A simplified stereo imaging system is shown in Fig. 1.
Fig. 1. A Simplified Stereo Imaging System
– : Focal length of the cameras (focal length must be the same for both cameras);
– : Point in the real world;
– : Coordinate of the point in the left image;
– : Coordinate of the point in the right image;
– : Distance between the optical center and ;
The goal of stereo vision research is to estimate depth information from a pair of stereo images. In order to improve the current techniques we must understand the theory required to gather depth information form a calibrated stereo vision pair.
Camera Model and Calibration
The real world is a three-dimensional world. Despite some controversy in binocular stereo vision and monocular stereo vision, it is complicated and difficult to get the corresponding depth information relying on only one image. However, we can calculate the depth map easily through two images obtained from a calibrated stereo camera. In order to analyze images with geometry theory, we need to model the system of imaging, and then process them with geometry methods. Four coordinate systems [8] are used in the stereo calibration include the world coordinate and camera coordinate, the image plane coordinate and pixel coordinate. Information in the world coordinate system switch to the camera coordinate system through an external calibration parameter matrix ‘W’ (including rotation matrix ‘R’ and translation matrix ‘T’) also known as extrinsic parameter matrix, and then to image plane coordinate system through the intrinsic calibration parameter matrix ‘K’. For this paper, the pinhole model will be assumed. This model can be seen in Fig. 2 and leads to the mapping of a 2D point on an image ‘p’ to a point in 3D space represented by. This mapping is expressed mathematically as,
(1)
Where ‘K’ is a 3x3 camera matrix, R is 3x3 orthonormal matrix representing the cameras orientation and ‘T’, is a column vector representing the cameras position.
Assuming that P (X, Y, Z, 1) is a point in the world, and p (x, y, 1) is the corresponding point in the pixel coordinate, so we can get the equation (‘s’ is a skew parameter usually considered as zero) [9]. The point of the camera coordinate, can be expressed as ,
(2)
Where ‘R’ represents a rotation matrix and ‘T’ represents a translation matrix. These two matrices are used to identify the transformations between the known camera reference frame and the known world reference frame. Intrinsic calibration parameter matrix consists of camera focal length ‘f’, the optical center ‘c’, and the skew coefficient. In regards to extrinsic parameters, these parameters depend on the camera position and orientation and unlike the intrinsic parameters, their value change once the camera pose is changed.
is a point in image plane, we can get the equation:
(3)
In this way, we can find the correspondence between image plane coordinate and world coordinate according to the two matrices. One of the most important purposes of calibration is to calculate the matrices and the other is to obtain the distortion coefficients of cameras. In this paper we use Zhang [10] calibration method in VS2013 and OpenCV library to do stereo calibration. Firstly, we designed a stereo system using two identical webcam to capture images of a chessboard patter target from different angles. Then we use the captured images to calibrate the system.
Depth Calculation
Considering a point ‘P’ in the world coordinate, as shown in Fig.1, the point is projected onto left and right 2D images at points and . From Fig. 1 it can be deduced that,
(4)
(5)
By combining equation (1) and equation (2) we get,
(6)
Where represents the focal length and ‘b’ is the baseline which represents the distance between both camera center. known as disparity, represents the pixel separation between corresponding points in the two images. In order to obtain the depth value, ‘Z’, we need to find these parameters through camera calibration.
Fig. 2. Pinhole camera model
Object Distance Measurement
The flow of the real time object measurement system that we proposed started with the acquisition of stereo vision image. Then a preprocessing is applied on both images, and a foreground extraction is implemented. We segment the images to create a mask that can be used to isolate the target object been observed. After that, a block matching algorithm that works using SAD (sum of absolute difference) method to find matching points is used to computed the disparity. Since the disparity provided by the block matching algorithm is not stable, in order to enhance the disparity quality and obtain a stable disparity map, we implement a method that uses weight mean image memory. More details will be discussed in the next section. Fig. 3 shows the flow of the real time object measurement system.
Fig. 3. Flow of the object measurement system
Image Acquisition
The stereo image acquisition is done by using two identical webcams, which are setup in parallel and perfectly aligned, such that their sensors are co-planar and the correspondence points are located in the same horizontal lines. The object distance can be measured when it enters in the field of view of both cameras.
Image Preprocessing
Image preprocessing is a common method in computer vision system. Preprocessing increase the image quality and improve the computational efficiency. In [11], after the acquisition of the pair of images, the images are downscaled in order to improve the computational speed. Experiments show that by reduction of image resolution does not affect the accuracy of the system.
Another important preprocessing to improve the computation speed is through converting the images from RGB to gray scale color space.
Image Segmentation Based Foreground Extraction
For a real-time stereo vision system, the processing time is the most importing thing to take into consideration. Normally, the complete dense disparity map containing all the depth information of the scene captured by the cameras, which might provide some information not relevant for the system. This makes it more difficult in the post-processing such as distance estimation or 3D reconstruction of the target object. For this reason, in this research a method that extracts foreground objects without affecting the processing power is implemented.
There are a vast number of techniques for image segmentation. The fastest approach is to find a threshold that can divide the image intensities into two parts. With this pixel’s whose intensity value is lower than the threshold will be rejected. This method considers that the foreground object has a distinctive color as compared to the background.
Since the threshold is an image dependent value that cannot be universally defined, an adaptive approach to compute the threshold need to be introduced. In this research, Otsu’s method is selected [12]. This method reduces the gray level image to a binary image by minimizing the weighted within-class variance and maximizing between-class variance. According to Otsu’s method, the binary image was divided as two phases: background and foreground. This method works best for images containing two classes of pixels or modal histograms.
In order to get good results the following steps have to be considered:
(a) Apply Gaussian smoothing in order to reduce the noise in the original image. This can be done by convolving each pixel in the image with Gaussian Kernel.
(b) Compute threshold using Otsu’s method and produce a binary image. This is done by exhaustively searching for the threshold that maximizes the inter-class variance.
(c) Apply a morphological closing operation to eliminate downward outliers in the image. This can be done by first dilating and then eroding the image
(d) Extract obstacles by applying the binary image as a mask to the original image.
Stereo Matching
Matching a 3D point in the two different camera views can be computed only over visual areas in which the views of both cameras overlap. Once we get physical coordinate of camera or the size of object in the scene, the relevant disparity map can be received by triangulation disparity value between matching points in two different camera views. Since our system is designed to gear toward depth perception, a segmentation based foreground extraction technique is implemented prior as preprocessing step for stereo correspondence. In this paper, the local dense stereo correspondence method called block matching stereo algorithm is chosen due to its fast implementation. The block matching algorithm is a local dense stereo correspondence that was first developed by [13]. The block matching algorithm works by using SAD (sum of absolute difference) method to find matching points between the left and right stereo rectified images and only finds strongly matching point between the two images. Although this algorithm produces less accurate disparity map, it is very fast and thus is ideal for real time purpose
There are three steps for the block matching algorithm [14], dealing with undistorted, rectified stereo pairs:
Prefiltering: apply a pre-filter in the input image to normalize the image brightness and enhance the texture first in pre-filtering step.
Matching Process: this process is carried out by sliding an SAD window. According to every feature in one image, the best matching point in the same row of another image needs to be searched. Since images are rectified, every row in two images is an epipolar line. Hence, matching points must have the same row in two images.
Prefiltering: Since disparity map suffers from false matching, this step eliminates incorrect matching points, which are induced by shadow or noise.
However, the disparity image can now be estimated in real time using stereo block matching, the results provided by the standard block matching algorithm are very unstable. Therefore, in order to estimate a disparity image with stable and accurate results, and still keeping the real time computation requirement, a new approach based in memory image was implemented. This approach consists in generating a new disparity image, which for each new block matching process, the weight mean value of each single element of the block matching disparity image and the previous memory image is copied to a new memory image only if the match was found for the target pixel and the disparity is not null. By combining the Eq. (4), we can express it mathematically as:
0
Where, is the estimated disparity matrix at the iteration, Denotes the memory matrix and S [k+1] is the succeeding iteration of memory matrix after being compared with the lately estimated disparity matrix,. and denote the input weighting of the memory and new disparity elements respectively. Finally, we can replace the block matching disparity with the memory image and use it to project 3D information.
Depth Estimation
Stereo vision tries to recover objects 3D information based on disparity and triangulation. Once disparity map is obtained by stereo matching, depth map, which is used in 3D information recovering, can also be computed. In this paper, we use re-projection matrix to compute 3D coordinate. In order to compute the depth information, was to use the camera calibration matrix, along with the disparity value for each pixel in order to estimate the corresponding 3D spatial coordinate for each pixel. This process was able to be carried using the equations described in, Eq. (5), (7) and (8). They use the previously computed disparity map and calibration matrix. The computed 3D matrix is the same size as the original image but elements contain vectors of length three. Each vector corresponds to that pixels x, y and coordinate in the calibrated world coordinate system.
Result and Discussion
The algorithm was developed using C++ programming language with OpenCV library and run on a laptop core i7, 3.2GHz processor. Experiments were conducted to test the accuracy and the speed of the proposed real time object distance measurement system, as the impact of the foreground extraction technique. Test results of image segmentation show that our technique is very efficient in terms of time performance. The total processing time of object extraction is approximately 2ms with a 320×240 image resolution. This makes it ideal for real-time stereo correspondence, which requires a frame rate of 30 frames per second. The table 1 shows some distance computed by the system. Figure 4 shows some results of using a disparity map with prior foreground extraction and the standard disparity map. Table 2 shows the average segmentation processing time in different environment, following the Table 3 with the results of processing time for each disparity cycle.
Table 1
Object Distance Measurement Results
Actual Distance (cm) |
Measured Distance (cm) |
Average Error (cm) |
60.00 |
60.31 |
-0.31 |
80.00 |
80.20 |
-0.20 |
100.00 |
100.34 |
-0.34 |
123.50 |
123.83 |
-0.33 |
150.00 |
149.72 |
0.28 |
187.80 |
187.65 |
0.15 |
307.00 |
306.42 |
0.58 |
357.00 |
356.06 |
0.94 |
407.00 |
405.34 |
1.66 |
Fig. 4. Simple scene (left) and complex scene (right) disparity map computed by the proposed method.
Table 2
Processing time for segmentation tests
Test Scene |
Processing time (ms) |
Simple Scene |
1.5 |
Complex Scene |
2.0 |
Table 3
Average processing time per disparity cycle
Test Scene |
Processing time (ms) |
Simple Scene |
22.60 |
Complex Scene |
31.12 |
From the results shown in in the Tab. 1, we can see that the approach used to estimate distance from the system to target objects provide very accuracy results when the target objects are located in distance less than 3 meters. For objects placed over this distance, the accuracy of the algorithm decreases significantly. This problem can be solved by having a better accuracy results from the camera calibration, by increasing the baseline that separates the cameras and increasing the image resolution. By increasing the image resolution, which would provide a more precise measurement, but since our aim was to obtain depth in real time we opt to decrease the resolution of the image so that process can be performed in real time. From the Tab. 3 we can observe that the average depth calculation is 31.12 ms, which allow the system to work in real time with frame rate of 30fps.
Conclusion
In this paper, the development of low cost stereo vision for distance measurement has been described. The results for depth estimation from this system have been characterized with the importance of accurate and real time for mobile robot platforms. A segmentation method used prior to the block matching algorithm was proposed in order to extract the object target from the scene of view by creating a mask for every single block matching process. The mask is then used to simplify the disparity, by ignoring the irrelevant background depth information. Finally, a method that enhances the accuracy of disparity map results is proposed. In order to enhance the disparity quality, the proposed method uses a weight mean image memory. The optimization of the cost value is using block matching algorithm method, where a new disparity “image memory” is formed to keep the weight mean for each element of each block matching computation. The combination of the memory approach and the prior segmentation provide more accurate and stable results for our real time stereo system. But, there is much space to improve the accuracy of the depth map, and need to continue exploring and researching.
References:
- Marron-Romera, M. and Garcia, J. Stereo vision tracking of multiple objects in complex indoor environments // Sensors — 2010 — № 10 — pp. 8865–8887.
- Samper, D. and Santolaria, J. A stereo-vision system to automate the manufacture of semitrailer chassis // International Jornal of Advanced Manufacturing Technology — 2013 — Vol. 67 — Issue 9–12 — pp. 2283–2292.
- Cai, L. and He, L. Multi-object detection and tracking by stereo vision // Pattern Recogn. — 2010 — Vol. 43 — Issue 12 — pp. 4028–4041.
- Perek, P. and Makowski, D. Automatic Calibration of Stereoscopic Video Systems // 22nd International Conference on Mixed Design of Integrated Circuits and Systems — 2015 — pp. 134–137.
- Menze, M. and Muhle, D. Using Stereo Vision to Support the Automated Analysis of Surveillance Videos // XXII ISPRS Congerss — Intern. Archives of the Photogrammetry, Remote Sensing and Spatial Information Science –Vol. XXXIX-B3–2012 — pp. 47–51.
- Kumari A., Dharsenda and Joshi, Sanjay Real Time Stereo Vision System for Safe Driving Distance Measurement // Scientific Journal of Impact Factor — 2015 — Vol. 2 — Issue 2 — pp. 181–184.
- Kordelas, G. A., Alexiadis, D. S. Enhanced disparity estimation in stereo images // Image and Vision Computing — 2015 — Vol. 35 — pp. 31–49.
- Liu, F., Xie, M. and Wang, W. Stereo calibration method of binocular vision // Comput. Eng. Des. — 2011 — Vol. 32 — Issue 4 — pp. 1508–1512.
- Wilczkowiak, M., Boyer, E. and Sturm, P. Camera calibration and 3D reconstruction from single images using parallelepipeds // Chicago: s.n.,. Eighth IEEE International Conference — 2001 — Vol. 1 — pp. 142–148.
- Zhang, Z. A flexible new technique for camera calibration // IEEE Transactions on Pattern Analysis and Machine Intelligence — 2000 — Vol. 22 — pp. 1330–1334.
- Hsu, Tsung-Shiang and Wang, Ta-Chung An Improvement Stereo Vision Images Processing for Object Distance Measurement // International Jornal of Automation and Smart Technology — 2015 — Vol. 5 — No. 2 — pp. 85–90.
- Otsu, N. A Threshold Selection Method from Gray-Level Histograms // Transactions on Systems, Man and Cybernetics — 1979 — Vol. SMC-9 — No. 1 — pp. 62–66.
- Konolige, K. Small vision system: Hardware and implementation // Proceedings of the International Symposium on Robotics Research // Robotics research — 1998 — pp. 203–212.
- Bradski, G. and Adrian, K. Learning OpenCV: Computer Vision with the OpenCv Library — 2008 — O`Reilly Media, Inc. California — ISBN 13: 9780596516130.