Chapter 1
Introduction
Since the early years of computers the keyboard and the mouse has been the
most popular way of human-computer interaction; however in recent years it
has been observed a new trend in searching for more natural and immersive
kind of interfaces, making users able to interact with their whole body or
just touching or moving input devices.
Touchscreen display is a well known technology, as the rst exemplary
were introduce in the market in the second half of the 1960s
1
but its mas-
sive consumer use started since the introduction by Apple Inc. of the
iPhone in 2007
2
; after this event, touchscreen devices became more and
more popular and other interesting application were launched in the market,
like Microsoft Surface from Microsoft Corporation, an interactive table
developed as a software and hardware combination of technology that allows
multiple users to manipulate digital content with gestures performed touch-
ing the table, and that can interface to other physical devices leant on its
plane. A revolution in the Human-Computer Interaction (HCI) eld were
introduced by Nintendo Co,Ltd. that launched the Wii in the late 2006
3
, a last generation gaming console with a set of controller devices that make
the user able to interact with the system with his body, thanks to motion
sensors and infrared transmitters and receivers that allow to estimate the
1
http://en.wikipedia.org/wiki/Touchscreen#History
2
iPhone is a new generation smartphone that presents a multitouch interface as main
interaction way. More information at http://en.wikipedia.org/wiki/IPhone
3
http://en.wikipedia.org/wiki/Wii
8
CHAPTER 1. INTRODUCTION 9
3-D position of the controller.
A further step in this direction was announced by Microsoft that in June
2009 revealed its Project Natal, whose aim is to enable users to interact
with the system in a natural way, without any kind of physical device or
colored marker, just with gestures made with the whole body and with vocal
commands
4
.
In this context, the analysis of video streams in order to infer information
about the captured scene, became an interesting activity as video capture de-
vices and other connected technologies like storage devices and the internet
access got more and more cheap in recent the years; the decreasing prices
enable the development of advanced interaction systems with low cost hard-
ware. In HCI eld, for example user’s hands position and con guration can
be an highly informative piece of knowledge to make the system able to re-
spond to speci c con guration or to a certain gesture trajectory. However
implementing such an interacting method is more challenging then imple-
menting interaction by an ad-hoc input device; in the last case information
is directly provided from the device, while in the rst case no information
is provided about hands position or con guration. In order to estimate this
knowledge, a huge quantity of noisy data from image observations needs to
be analyzed, ltered and interpreted; in addition, device-dependent limits,
like sensor noise or poor quality images, can degenerate the usability of such
a system, without requiring the user to wear any colored marker like gloves.
The general problem addressed in this thesis is the discovery of data
patterns embedded in a bigger set of data; more speci cally there are mainly
two kind of searched patterns:
speci c trajectories performed by the user with his hands;
speci c hand positions, or con gurations.
Both these problems require the previous localization of user’s hands, thus
the overall task involves the analysis of low level features as well as the
intepretation of higher level information. The goal of this thesis work is
to implement a human-computer interface based on hand gestures, using
4
http://en.wikipedia.org/wiki/Project_Natal
CHAPTER 1. INTRODUCTION 10
algorithms that represent the state of art and proposing new in the case of
need, and nally test the developed implementation; hands must be located
analyzing images taken from a webcam and without requiring the user to
wear any kind of device or colored marker.
The overall framework is composed of the following modules:
hands con guration models learning : the system needs to be
trained on the con gurations it must recognize;
trajectory models learning: as for con guration, the system must
learn the trajectories that it is requested to spot;
hands detection: the system must detect the location of all hand
instances that appears in the processed images;
hand tracking: in case of multiple detection, each hand occurence
located in a frame needs to be correctly associated with one detected
in the next frame;
nger tips detection : this module, given a hand image, locates the
ngertips;
hand features extraction: given a ngertips map, it builds features
needed in the con guration matching process;
hand con guration matching : detected hand con guration are eval-
uated against stored con guration models in order to get the best
match;
trajectory features extraction: hands locations are used to build
features necessary to the matching and spotting modules;
on-line trajectory matching (with pruning): the observed hand
trajectory is compared with those previously learned, in order to nd
the model that is the most similar;
spotting the recognized trajectory: this module state if a known
trajectory, reported by the matching module, actually occured.
CHAPTER 1. INTRODUCTION 11
Figure 1.1: System activity diagram
In gure 1.1 it can be seen the activity diagram of the system. The sys-
tem takes in input a video stream that is processed according the stages
listed above. If a particular hand con guration is pointed out or a trajectory
is spotted, the system can raise an event that will be processed in a sec-
ond moment. The precise tracking of hands allows to connect some kind of
"analogic" commands, like moving mouse or dragging elements in a virtual
environment, in addition to "spotted" commands, like mouse click.
In order to discover the hand con guration, the system matches features
extracted from the on-line video stream with models, learned during an o -
line phase. To learn models, system is fed with hands images in which hands
CHAPTER 1. INTRODUCTION 12
assume the desired con guration and nger tips are depicted manually.
In a similar way, trajectories are matched with models learned o -line: here
models are learned from recorded video streams in which the user performs
a gesture trajectory wearing a colored glove against a neutral background
(neutral is relative to the color of the glove), enabling an easy detection of
the gesturing hand.
During the on-line input stream, user is not required to wear any kind
of marker to track hands; hands detection module searches for "skin that is
moving", while hand tracking module couples detection of a frame with them
of the next. Once a hand is segmented, ngertips detection module returns
possible locations of ngertips, used as features to assign a con guration la-
bel.
Locations of hands are known frame by frame, thus it is possible to ex-
tract position and motion and start the trajectory matching process; during
this phase, unlikely matching hypothesis are rejected by pruning classi ers
learned in the o -line training; thanks to pruning, performance are enhanced
both in accuracy and speed. Finally spotting module states if a gesture have
been actually performed by the user and, if so, gesture class label, matching
cost, start frame and end frame are returned.
Once gestures labels are known (both trajectory gestures or hand con g-
uration gestures), it is straightforward to connect system directives or pack
informations in an event to build complex client-server architectures.
As an application example, they were used gestures to take control of a mul-
timedia player: the user must be able to choose next or previous media le
from a playlist, increase or decreace sounds volume, run along the le, play
or stop it and close the application. The multimedia player used is VLC
(VideoLan Client)
5
.
5
http://www.videolan.org/vlc/
Chapter 2
Hands tracking
As introduction to this chapter a general, and high level, formulation of the
problem of object tracking is provided:
De nition 1 given an ordered images input stream (frames) and a target
class of objects, it is asked to locate each istance of such an object frame by
frame and, for each detection of the target class in a given frame, to couple
this occurence with a corresponding one in the following frame, acquired an
evidence that they represent the same object istance.
From this de nition, it is possible to distinguish two subproblems, and a
problem dimension for each of them:
- detection of objects in a single frame (spatial dimension);
- tracking of a single object along the entire input stream (temporal
dimension).
Processing an image in order to solve the spatial problem, requires the def-
inition of some features characterizing the object class it is asked to search
for; typical features used in such a task are color distribution, shapes, edges,
moving information and so on. These are low level kind of features, easier
to codify with a computer then higher level information, like the context
where the object can be found or its interactions with the rest of the world.
13
CHAPTER 2. HANDS TRACKING 14
The objects tracked in the context of this thesis work, are those that can be
classi ed as "human hands".
Human hand is a very articulated object with many degree of freedom,
that can assume a big number of shape con gurations: some works [17], [24],
[5] models the hand in a 3-D euclidean space, and then estimates its posture
synthesizing the model and varying parameters until its 2-D projection and
the real hand image appears similar enough. However these approaces works
when the hand is already segmented, but due to the non rigid nature of hu-
man hand it is not possible to rely on shape features to e cently locate it
in a complex scene. Another feature commonly used in literature ([4], [26])
is skin color distribution, that it is also adopted in this work. However, skin
color detection methods are not precise enough to grant that only skin-made
objects will be located: these methods are very sensible to ambient light
variations and it is quite common that wooden doors, pink owers and other
similar skin colored objects will be included as well.
There is another feature that helps to choose between unanimated skin col-
ored objects and human hands (and faces): doors and owers usually appear
still, while hands quickly change their appearance; from this consideration,
follows that if only moving skin colored object are searched, then the detector
become more robust. In this way, typically also face will be deteted (despite
hands move usually faster than head) but, if necessary, its location can be
rejected during further processing.
In order to locate objects that quickly change their appearance (like
hands), in [26] it is described the idea of residue image. Such an image
is computed partitioning in blocks the current gray level valued image and
assigning to each block a scalar value, proportional to how di erent its area
appears in the current frame, compared to the appearance of the most similar
area in the next frame.
Once hands have been located in a single frame, the timing problem still
remains: it is necessary to track all movements of a single instance of a hand
along the whole video sequence, but skin colored moving objects detected in a
single frame can be more than one. Some methods are provided in literature
to properly couple detections of a frame with detections of the following one.
In [26] authors use a probabilistic method to nd the best current match,
CHAPTER 2. HANDS TRACKING 15
analyzing the history of hand locations, while in [4] authors rely on su cient
frame rate to suppose that consecutive hand locations will be very near one
each other, such that the best match is the nearest detection.
In this thesis it is implemented the approach given in [4], that has the
limit to e ectively detect hands if it is guaranteed a very good skin detection;
this condition means that if other skin colored objects are visibles in the
background, the detection has an high probability to produce unsatisfaing
results.
The contributions of this thesis work is to extend the detection method
described in [4], removing from the scene those areas that appears very similar
from one frame to the next, using information given by residue images, and
then applying the skin detection algorithm only on this ltered result. In this
way the detection of the areas where hands (and faces) are located is more
robust even in presence of many skin colored objects in the background. A
problem introduced by this modi cation is analyzed in section 2.4 and a way
to handle it is proposed in the same section.
2.1 Residue image
As already mentioned, residue image, rst described in [26], reports informa-
tions about how much an object changes its appearance as the time ows.
In order to infer such kind of knowledge, it is necessary to compare at least
two consecutive frames, as appearance changes happen along the time axis.
Residue image is based on the key idea that due to the non-rigid nature of
hand, its appearance changes more frequently in time, compared to that of
others objects, hence it is possible to exploit this feature and search for re-
gions in a frame that doesn’t have good matches in the next one. To nd best
matches among blocks partitions of two sequential frames it can be used the
block matching method (applied also to compute the optical ow): for each
couple of consecutive images, the rst one is partitioned into several blocks
and then, the best match is searched in the next frame, by translating the
current block within a search area and selecting the region that minimizes
the appearance di erence, according to a distance measure. The algorithm
returns a matrix contaning for each block, its motion vector. Once "block
CHAPTER 2. HANDS TRACKING 16
ow" information is obtained, the residue R
B
is computed for each block B
of dimension m n and its match M
B
of the same size as
R
B
=jB M
B
j (2.1)
whereB andM
B
are the average values of pixels inB andM
B
, respectively;
in other words this is the absolute di erence between the average value of
gray level pixels of current block and the average value of pixels of the area
that as been matched in the next frame. Because of the non-rigidity of hands,
residues tends to have higher values in hand regions.
Residue image is a good choice to nd non rigid moving objects, because
it returns a lled area that can be easily turned into a blob, while typical
methods to estimate motion, like frame di erencing or optical ow computa-
tion, tends to have the highest values distributed along edges. An example
of residue image can be found in gure 2.1.
2.2 Skin detection
The problem of skin detection is very challenging; to identify skin, the most
natural feature that can be exploited is color, that varies among a wide
range of values and has the big disadvantage to be very sensible to light
variations. In other words, the appereance of skin is not the same under
di erent illuminants.
In literature, it is possible to distinguish between two di erent approaches
to this problem:
- proposing color value thresholds;
- analyzing the skin color distribution.
For example some works suggest possible bounds, that depends on the color
space adopted, in which the colors assumed by skin are constrained; in [9] ge-
netic algorithms are used to estimate bounds for seven di erent color spaces.
Starting from proposed bounding models found in literature, they re-estimate
thresholds, in order to achieve a precision, recall or trade-o strategy using
fitness =
recall precision
recall + precision
(2.2)