Abstract
The master thesis works towards a new approach for obstacle detection using stereo vision in mobile
robotics. The robot to which the results of this work are meant, faces an unstructured natural environment
with big differences from one scenario to the other. The main idea is the use of three-dimensional
representations of the surroundings via point cloud instead of usual bi-dimensional images.
To accomplish the purpose, the thesis introduces two filters enhancing each other to improve the
performances of data acquired from the stereo vision system. Moreover it advances a solution to extract
the ground surface and, based on a simple region growing technique, a novel method to perform a space
decomposition into distinct objects. Due to the good results of these steps, a classification procedure
identifies the obstacles providing a semantic labelling.
All the contributions presented throughout the chapters are successfully tested on real datasets of
images with empirical evaluation of the results on a mobile robot navigating in outdoor environments. It
finally discusses the problems connected to the non homogeneous surroundings remarking the main
difficulties that the thesis tackles.
vii
Chapter 1
Introduction
The robust obstacle detection is always a very important goal for intelligent robots. For example,
the adaptive cruise control and the forward warning system are nowadays considered normal optionals
in vehicle markets. The obstacle detection is faced in various manners and using different sensors like
vision system, radar, sonar, etc. Though the data quality of the other sensors is sometimes better, the
vision sensors have many advantages like the low cost, the dimension, the big quantity of information
and last, but not least, they reproduce the way the human body detects distant objects. The question
arises spontaneously: is our body the perfect machine, or not? Would we like to develop a system able to
simulate and react as well as our eyes? Of course it is not easy, but we can approach really close to that
goal. Recently, stereo vision systems have been used for autonomous off-road robots with a big interest in
different fields, among which there is the military, as mentioned before the automotive, gaming, civilian
applications as well as planetary exploration.
A second big technological challenge would be the ability for these systems to perceive the environ-
ment and use such information for control. Indeed we could think to use the vision system to correct
data coming from other sensors like Global Positioning System (GPS), Inertial Navigation Unit (IMU),
laser scanner and odometric sensors.
It is clear that driving in outdoor environments, and particularly in non-urban environments, is much
more challenging than driving in household scenarios. In fact, in a structured surrounding it is easy and
feasible to classify objects and therefore to divide the problem into simple sensing and action strategies.
For example, in urban situations it’s expected that the ground surface in front of the robot is planar.
Instead, the robot in a forest should constantly determine what is in front because of the small bumps
and holes which always change the prospective of the cameras. Other challenging situations include
the recognition and classification of the objects. What is an obstacle from a geometric point of view (for
instance a small bush or some tall grass) is not really an obstacle and it may not represent a danger for
the robot. Furthermore a color classification could be useful to recognize a certain number of distinctive
classes of objects, like bushes, trees, stones and poles. Unfortunately, the color-based classification
introduces a new challenge in which one expects to avoid ambiguities and noises like sun reflections,
effects of the illumination spectrum, chromatic shifts due to atmospheric effects and season changes.
Though the obstacle detection and avoidance is well studied in robotics, the existing algorithms are
mostly made for urban and indoor environments. This fact causes bad results in off-road scenarios
because the typical simplifications (such as the flat ground and clear distinction between two object)
aren’t valid here.
1.1 The World Representation
Looking at the future, the human being wants to know more about the ecosystem in which he
lives. Environment explorations, performances of scientific experiments, rescue operations or object
1
2 CHAPTER1. INTRODUCTION
(a) Original image and the selected template.
(b) Underexposed image.
(c) Mismatch examples.
Figure 1.1: Template matching problem.
deliveries are just a few of ambitions that he would like to accomplish. Because of the limits of the
human body, these dreams can not always come true. But the research holds on and the motivations
are really strong. One of the ultimate goal in the mobile robot field is the navigation and mapping of
unstructured surroundings. To this end, this dissertation thesis tackles the problem of the obstacle
detection, determining where the robot can safely move in a scenario; moreover it tries to determine
whenever the detected objects represent a real warning for the robot integrity.
The human perception takes the environment as images, and describes the world in terms of what
it sees and of what its experiences are. From this simple point of view, one may suggest that we could
solve any kind of problem in a 2D context. But, as shown in Figure 1.1, the true semantic meaning of a
1.2. ACQUISITIONMETHODS 3
detected object can easily fail in different scenarios. A first reason is connected to the fact that monocular
computer vision depends on the quality of the camera, so there is a strict relation with the camera
construction deficiencies and the data-stream itself. One case is when the image is underexposed as in
Figure 1.1b; it makes very hard for a 2D image processing application to recover the missing information.
This problem is probably solved with the use of better camera sensors.
A second reason is linked to the fact that the computer vision system uses 2D images which are just a
projection of the 3D world on a plane. As shown in Figure 1.1c, a good match is dependent on the context.
In the first case, while the system could apparently match the model template (a tree), after a zoom-out
we observe that the tree was in fact another picture itself, lying in a different 3D geometric shape (an
advertisement panel in this case). In the second case, it could be really hard for the system to recognize a
single template among objects with similar characteristics.
1.2 Acquisition methods
The ability to realize sensors that can provide three-dimensional data (3D) has always aroused
considerable interest both in scientific and industrial fields because of the many applications where
these sensors could be used. Interestingly the three-dimensional passive sensors based on traditional
cameras are potentially cheap and less invasive in the environment in which they are used compared
with sensors based on active technologies.
So, though there are many methods to get distance information and convert them in a 3D representa-
tion, in the context of mobile robots the most used approaches are:
Time of Flight (TOF): system which estimates the distance from the sensor to a surface measuring the
time that an emitted signal takes to hit and return from a target object. The usual technologies are:
• Pulsed light source with digital time counters: It consists of a pulsed laser and an imaging integrated
circuit for a fast computation of the distance of every pixel.
• Range gated imagers: A shutter in front of the camera sensor is opened and closed at the same
frequency with which the light pulses are emitted. According to the arrival time, part of the returned
light will be blocked by the shutter closing making possible the distinction between far and near
objects.
• RF-modulated light sources with phase detectors: It works modulating the outgoing beam with a
Radio Frequency carrier; the distance is inferred measuring the phase shift of that carrier in the
receive side.
Knowing the propagation speed of a ray and using precise systems to measure the time of flight, a simple
approximation of the distance d is given by:
d˘
c¢ t
2
where c is the speed of the ray (the light speed in case of laser sensors) and t is the flying time.
On one side, the advantages of this system are the simplicity and the speed; on the other, some of
the disadvantages are the multiple reflections, the cost and the disturbance caused by the sun or other
sensors of the same type running at the same moment.
4 CHAPTER1. INTRODUCTION
Triangulation Techniques: the distance of a point is estimated with the elaboration of the same infor-
mation seen from two different sensors at the same moment. It requires a calibration of the devices and
the use of algorithms to estimate the depth of the points.
Though the existence of many different triangulation systems, the stereo camera is the most used in
mobile robotic applications. Apart the acquisition speed, the major advantage is that it’s passive which
means that it doesn’t need to send any kind of light or sound source to retrieve the information. Besides
this, the major disadvantage is the computational complexity derived by the stereo matching algorithms.
In that sense there are many techniques but most of the time they produce good results to the detriment
of speed, making them unsuitable for real time systems.
The triangulation technique in stereo vision will be tackled in depth in the Chapter 3.
1.3 Thesis Outline
Figure 1.2 depicts the roadmap of the organization of the thesis. Chapter 1 introduces the problem
and gives the motivation for the progress of the thesis. Chapter 2 collects and analyses the problem in
order to develop a proper solution. Chapter 3 introduces the basic method and math used to acquire the
3D information. It shows also an introduction to the libraries used to test all the assumptions. Chapter 4
describes the 3D points representation and the features derivable from them. In particular it will talk
about some algorithms for the estimation of geometric shapes from a bunch of three-dimensional points.
Chapter 5 introduces the filtering techniques to improve the description of the surrounding and filter the
noises. Chapter 6 investigates the basic cluster techniques derived from the Euclidean description of the
space. Based on them, it proposes an advanced method of clustering and produces an occupancy map
useful for the robot navigation. Chapter 7 makes a classification of the objects recognized in the previous
chapter. It is based on the color histograms and on the sample second moment.
Chapter 1
Introduction
Chapter 2
System
Overview
Chapter 3
Calibration and
Stereo Vision
Chapter 4
3D Repre-
sentations
Chapter 5
Filtering
Chapter 6
Clustering
Chapter 7
Classification
Figure 1.2: Thesis Roadmap.
Chapter 2
System Overview
The project started as a support for the navigation of the rover iRobot ATRV Jr. shown in Figure 2.1.
This is a 4-wheel skid-steer robot and it is an excellent bench-test as it has various types of sensory
support (like GPS, gyro, encoders, laser scanner, Crossbow Inertial Measurement Unit), the structure is
very sturdy, it has efficient motors, differential guide and the wheels are relatively high and thick ensuring
the ability to operate in rough and rocky terrain.
Figure 2.1: The iRobot ATRV Jr. used for the project.
The general idea is to exploit these features as a support to the autonomy of the robot and to plan a
motion path while avoiding obstacles. For this purpose, it will be crucial to establish the object disposition
in the surrounding environment and classify the most possible types of obstacles in order to determine
their risk factor.
As already mentioned, this thesis will tackle these last features. The presented solution represents
therefore only a subcomponent of a larger integrated architecture and it will give a good support for a
future development of a motion planning and navigation routines such as a visual odometry system.
2.1 Problem Requirements
The tools available to achieve the objective are essentially two: one on-board computer and a stereo
vision system composed of two Guppy cameras. The design conditions are:
5
6 CHAPTER2. SYSTEMOVERVIEW
Camera Support Characteristics
Baseline 130 mm
Position (X/Y/Z) (from the robot coordinate system) 0.4 m / 0 m / 0.63 m
Tilt Angle 0.15 rad
Camera Characteristics
Interface IEEE 1394a - 400 Mb/s, 1 port
Resolution (Raw8) 752 x 480 pixels
Sensor Type CMOS Progressive
Sensor size Type 1/3 (4.80 x 3.60 mm)
Max frame rate 64 fps
Dimensions (L x W x H) 48.2 x 30 x 30 mm
Lens Characteristics
Lens aperture F/1.8 - Close
Focal Length 6.5 mm
Table 2.1: Stereo device characteristics.
• implementation in RTAI-linux (Real-Time Application Interface for Linux) which is a patch to the
Linux kernel with the introduction of a hardware abstraction layer and a variety of services meant
to simplify real-time programming.
• The use of API modules and libraries which are integrable with the actual framework DTU Mobot-
Ware (Mobile Robot software). The last is a plug-in based real-time software platform developed
in the Automation and Control Department of the DTU with the purpose of an easy integration
with sensors, fast porting to different robots and easy extensibility for new hardware platforms and
configurations.
• Responses within strict time constraints to guarantee the real-time processing of the overall system.
To achieve this goal, certain algorithms and functions will be used, which might not guarantee the
best results in terms of solution, but these will be the best compromises regarding the execution
time.
The stereo support is made with two Allied Guppy F-036C + lens Tamron 23FM65 (Figure 2.2a)
mounted on a metal basis and parallel each other (Figure 2.2b). The choice of these cameras comes from
the relatively cheap price for an industrial standard, the lens and optical sensor of good quality, and the
option to synchronize the acquisition of both cameras with a trigger signal.
The cameras are connected to the on-board computer with two FireWire cables (IEEE 1394 8-pin)
and they are connected together with an I/O connector for the external triggering. The output of each
camera is a raw image with a resolution of 752 x 480 in Bayer pattern format (also known as BGGR). This
pattern describes a color with four components composed of 50% green, 25% red and 25% blue as shown
in Figure 2.3.
This color distribution rises from the requirement to enhance simulation of the human eye retina
which is more sensible to the green light during daylight vision. The filtering and interpolation of these
four components becomes a color pixel with a final resolution of 376 x 240 for the color image. Finally,
the maximum acquisition speed is of 64 fps which is perfectly suitable for a real-time application.
The next chapter will investigate all the limitations and capabilities that we can expect from this
stereo system.
2.2 Related Works
The most typical representations of the world in robotics, use 2D maps such as occupancy maps
to plan the robot motion(Konolige et al. (2006)). Occupancy maps define the obstacles in a binary
2.2. RELATEDWORKS 7
(a) Allied Guppy F-036C camera. (b) Stereo configuration with Guppy cameras.
Figure 2.2: The used support for the stereo vision.
Figure 2.3: The Bayer pattern (BGGR).
way, which simplifies too much the complex interaction between robot and environment. For instance,
obstacles are dependent on the context in which the robot is operating and some tall grass could result as
a problem while it represents the ground during the motion.
Another method consists of the acquisition of hybrid 2D-3D maps with the use of laser sensors for
the obstacles and a visual system for the semantic labelling (i.e. Iocchi and Pellegrini (2007)).
A complete obstacle detection with classification is introduced in Nuechter et al. (2003) in which
the objects are shaped using the similarities with geometric features. Similar method was introduced
in Vandapel et al. (2004) where they used also LIDAR (Light Detection And Ranging) data to extract an
eigenvalue decomposition of the environment. Despite the good results, all these algorithms are not
suitable for the stereo vision in which the level of noise is too high to acquire acceptable data information.
Bright results were definitely gotten in Manduchi et al. (2005) in which they use algorithms for the
recognition of different kind of terrains for the cross-country autonomous navigation. The stereo recog-
nition is based on range measurements and it is moreover supported with a color-based classification
systems.
Another promising research was conducted in Rusu et al. (2009); they implemented a visual odometry
8 CHAPTER2. SYSTEMOVERVIEW
system in which they introduced for the first time the concept of points cloud in robot vision. The idea
was to elaborate the points taken in various instants, interpolate the data to reduce the noise and get
more details, and label the surrounding scenario basing only on the stereo vision information.
2.3 Problem Breakdown
The problem should be solved by developing programs easy to reuse and adaptable to contexts
different from those considered in the thesis. In addition, the Linux compatibility should be 100% to
ensure the possibility of creating a plug-in for the MobotWare platform.
The problem has been divided into these tasks:
a) The images produced by cameras, are initially converted from raw RGB format to BGGR in order to
improve the visualization and the elaboration.
b) After assessing the limits of the stereo vision system and the cameras, the calibration is performed
to obtain the information needed to correct lens imperfections and align the camera planes. To do
this, many images of a chessboard are acquired from different distances and perspectives and thus
processed using OpenCV libraries.
c) The stereo matching takes place by comparing two methods: the block matching algorithm and
the semi-global block matching algorithm. Although the second provides more detailed results and
is able to achieve a better sub-pixels matching, it is presented only as an example. The reason is
connected to the slow speed of processing and therefore the loss of the fulfilment of the real-time
requirement.
d) Once obtained the disparity map, the triangulation of points is accomplished and the information
is saved as data describing the environment in 3D (point cloud).
e) To speed up all the subsequent operations, the point cloud is processed at this moment to provide
a more compact representation. The literature in this field presents two main ways to design
the sampling approximation: the octree and kdtree structures. Both provide really fast search
methods, but the first is preferred because it naturally implements techniques for expanding and
reducing the dimension complexity. Another approach is the geometric modelling in which an
agglomeration of points is approximated with a geometric shape. The faster and easier method is
the RANSAC (random sample consensus) combined with a convex hull extraction. Being precise,
better solutions exist (like the concave hull or the PFH
1
) but they usually increase the difficulty
of the solution ending up with the loss of real-time processing. To test the various methods, PCL
libraries are used as an aid to the implementation.
f) The information is full of noises and need to be improved. A good idea is to use a pass-through
filter to limit the data in a bounding box of interest. Then, another filter is designed to highlight the
features of objects in the point cloud. The procedure is to delete the data that is not dense enough
(for instance points without other close points around), and the areas of space that is too bright (as
they are often part of the sky or solar reflections).
1
The Point Feature Histogram represents each point with its neighbourhood features. Therefore, it allows to easily group
sampled points in classes depending to the surface on which they lie. Though the method is really powerful and provides a good
segmentation of the space, it is computationally expensive during the guess of the features.