Abstract
We challenge the problem of human behaviours understanding and imitation in com-
puter vision and cognitive robotics applications. In this thesis we show how to design
and develop a complete real-time arm-hand gesture recognition system: starting from
videos we want to define descriptors able to capture human gestures and complex ac-
tions. We used a motion based approach without any prior knowledge of subjects in the
scene. Our aim is to define most general gesture primitives regardless of the object that
performs the action. Using primitives we can divide the gesture recognition problem
intotwolevel: lowandhighlevel. Inthefirstonewerecognizegesturesprimitivesusing
Mixture of Gaussians for each primitive, the high level system combines sequences of
primitives using Deterministic Finite Automata (DFA). We provide general gestures
descriptors, we model them and we show how to learn new gestures just from one-shot
demonstration. Finally we show how to implement two different applications based on
the above system. The whole thesis has been developed and implemented at Imperial
College of London, within the BioART laboratory under the supervision of Dr. Yiannis
Demiris.
1
Part I
Overview
8
Chapter 1
Introduction
The analysis of human movement and in particular the recognition of actionsarere-
search areas that are generating growing interest and beginning to produce a variety
of applications: they require the integration of different fields, since they are based on
the application of principles and methods of sectors such as Pattern Recognition and
Computer Vision, as well as Probability Theory. Examples of recent applications based
on the recognition of human activities have mainly focused on Surveillance Applica-
tions, Control Applications or Learning by Imitation Applications. For surveillance
applications is very important to automatically monitoring and understanding people’s
behavior and abnormal activities. A control application could be in video-games do-
main: the gamer can interact with the system with whole body’s movements, so the
system should be able to recognize his commands in a robust way and hopefully with-
out much delay. Another important application is to teach a Robot how to perform
some tasks, i.e. in a factory where operators can train their robots for work just from
demonstrations. The weakness of these systems is that usually they may require to
wear special suits while, from an application point of view, computer vision methods
are suitable only if they provide non-invasive solutions.
There is a neurological evidence that human actions are connected to the motor
control of human body [35, 68, 69]: when a human is viewing other agents performing
actions, the visual system seems to relate the visual input to a sequence of motor prim-
itives. The neurological representation for visually perceived, learned and recognized
actions appears to be the same as the one used to drive the motor control of the body.
These conclusions have gained notably attention from the robotics community [71] in
the field of imitation learning. In fact learning by imitation researchers are focusing
9
CHAPTER 1. INTRODUCTION10
on developing robot systems able to recognize, learn and imitate or assist human be-
haviours [21]. These findings motivate our motion based approach to identify a set
of action primitives that allows to represent human gestures and to combine them in
complex actions and possibly find the way to imitate them. In this sense, according
to the classification on human motion analysis, provided by Moeslund et al. [58], our
approach seems to fall into the category of action primitives and grammars, as no ex-
plicit reference to human model is used in the behaviour modelling. Here we shall focus
only on actions performed by hands and arms, although we extend the action class
beyond the concept of gestures (as specified, e.g. in Mitra et al. survey [57]). In fact,
potentially any general action, performable by hand and arm, can be included in our
approach, in so encompassing the specific interpretation of gestures as movements for
interaction and expressing emotion.
1.1 State of the Art
In this Section we provide a short description of the current State of the Art based
on most recent surveys [50, 58, 65] in the Gesture Recognition’s field. We present all
possible ways to develop a Gesture Recognition System then we briefly introduce our
gesture recognition system and the contents of the following chapters and the guide-line
of the tractation.
In Fig. 1.1 we show a complete gesture recognition system; main steps are required
to achieve successfully the recognizing of human actions and gestures:
• Scene Segmentation and Tracking: the system segments the objects in the scene
and detects which are the relevant ones for the tracking
• Features Extraction: the system extracts robust descriptors that characterize the
actions satisfactorily
• Action Modeling and Recognition: actions will be modeled by the system and
then it should recognize gestures correctly
1.1.1 Modeless vs Model-Based Approaches
The first step to achieve a good gesture recognition system is the detection of relevant
subjects that allows to obtain a good interpretation of the current scene. Mainly we
CHAPTER 1. INTRODUCTION11
Figure 1.1: General Gesture Recognition System
can have two different kinds of scene segmentation: the first one is modeless based,
while the second one uses humanoids model to capture gestures and actions.
In the modeless approach, systems try to learn and recognize activities by observing
the motion of objects without necessarily knowing their identity. A pioneering work
has been presented by Efros et al. [28]; they tried to recognize a set of simple actions
of people whose images in the video are very small and where the video quality is
poor. They use a set of features that are based on blurred optical flow. Robertson and
Reid [70] extended the work of Efros by developing a system in which complex actions
could be composed by the set of simple actions. In 3D representation is also possible
to use a modeless approach; Kakadiaris e Metaxas [49] for instance realized a method
to recognize body parts without using an a priori model of human body, based on a
spatio-temporal analysis of the silhouette of moving person.
Model-based approaches try to exploit a priori knowledge about human body; they
require a model of human body to segment and track parts we are interested in. This
approach matches the images sequence with the model data. Some approaches [20],
start out with silhouettes and detect the body parts using a method inspired by the
W4-system [42] which seems to work well under the assumption of good foreground-
backgroundseparationandlargeenoughnumberofpixelsontheobservedagent. Other
systems use 3D-model based body tracking approaches where the recognition of action
is used as a loop-back to support pose estimation [3, 25, 62, 74].
In our system we decided to rely on modeless approaches: our primary aim is the
definition of more general motion primitives regardless of the object that performs the
action, moreover model based approaches are often unfeasible in case of noisy and
imperfect conditions while modeless techniques are reasonably practicable.
CHAPTER 1. INTRODUCTION12
Human Tracking
Inthelastyearstrackingalgorithmshavefocusedprimarilyonsurveillanceapplications
leading to advances in areas such as outdoor tracking, tracking through occlusion, and
detection of humans in still images. The notion of tracking is strictly correlated to the
segmentation of the scene (that can be achieved using either a model-based approach
or a modeless one). In fact after detection of subject of interest it’s necessary to track
and predict his poses over the time. Given the state of N persons in the previous frames
and the current input frame, we are looking for the states of the same persons in the
current frame. Here the state is mainly the image position of a person, but can contain
other attributes, e.g., 3D position, color, and shape. Previously tracking algorithms
were mostly tested in controlled environments and with only a few people present in
the scene. Recently, algorithms have addressed more natural outdoor scenarios where
multiple people and occlusions are present. One important problem is how to handle
multiple people that might occlude each other. When the tracking has commenced the
problem was to find the temporal correspondences between predicted and measured
states. This has recently been approached using a correspondence matrix, which has
the predicted objects in one direction and the measured objects in the other direction.
For each entry in the matrix a distance between predicted and measured object is
calculated. This gives the likelihood that a predicted and measured object are the same
[4, 15, 40]. Alternatively, global optimizations can also be applied. Polat et al. [64] use
a Multiple Hypothesis Tracker to construct different hypotheses which each explains
all the predictions and measurements, and chooses the hypothesis which is most likely.
Objects are allowed to enter and exit the scene meaning that the number of elements
in the state vector can change. To handle this the particle filter is enhanced by a
trans-dimensional Markov chain Monte Carlo approach [39], this allows new objects to
enter and other objects to leave the scene. Advances in human tracking are motivated
by the increased focus on surveillance applications. For example, in order to have
fully autonomous systems operating in uncontrolled environments the segmentation
methodshavetobeadaptive. Thishastosomeextentbeenachievedwithinbackground
subtraction where analysis of video sequences of several hours has been reported [31].
However, for 24h operation special cameras (and algorithms) are required. Work in
this direction has started [10, 19] but no one has so far been able to report a truly
autonomous system.
CHAPTER 1. INTRODUCTION13
Figure 1.2: Examples of Motion History Images
1.1.2 Features Extraction
We describe here existing approaches for features extraction from image sequences. In
an ideal world features should be viewpoint, background, person and action execution
invariant. At the same time they must allow a robust classification of the action. We
can divide image representations in global and local representations.
In the global representation first we localize the subject using modeless or model
approaches described above, then the region of interest (ROI) is encoded as a whole
which results in the image descriptors. Usually global descriptors are obtained by a
silhouette, bythecontours, bytheopticalflowetc. Oneofmostusedkindofdescriptors
istheHumoment, wheretwotemplatescanbecompared; Bobick[6]forinstanceuseda
binary motion energy image (MEI), which indicates where is the motion, and a motion
history images (MHI), where pixel intensities are function of the temporal history of
the motion in that point, and models shapes are matched with MEI and MHI by Hu
moment comparison. Wang et al. [77] computed a transform to silhouette to obtain
translation and scale invariant descriptors. Silhouette and also contours are used by
Li [52], who extracted information like sampled points from the contours of silhouette
and center motion of gravity center. Silhouette can be obtained also from multiple
cameras, for instance Weinland et al. [78] combined silhouettes from multiple cameras
in a 3D model and used the motion history volumes (an extension of MHI) as main
descriptor. Completely different descriptors can be extracted by motion information,
like the optical flow; for instance Ali and Shah [2] derived several kinematics features
from the optical flow like symmetry, divergence etc.
Anintermediateapproachbetweenglobalandlocalrepresentationsistheglobalgrid-
basedrepresentation. TheROIisdividedintofixedspatialortemporalgrid,thisprovide
CHAPTER 1. INTRODUCTION14
a partial solution for occlusions and changes in viewpoint. A grid-based approach on
silhouette is used by Ragheb et al. [67], who compute for each spatial location the
silhouette transformation in frequency domain. Instead grid-based approach on optical
flow was used by Danafar [16], who extends Efros’ work [28] by dividing the ROI into
horizontal slices that contain head, body and legs.
Local representations describe the image as a collection of local descriptors. They
don’trequireaccuratelocalizationandbackgroundsubtractionandoftentheyareview-
pointinvariantandpersonappearance. Patchesaresampledeitherdenselyoratspace-
time interest points. Usually these points correspond to changes of movement that
occur in the video, so they assume that these locations are the relevant ones for the
recognition of human action. Laptev and Lindeberg [51] extended Harris’ corner detec-
tor [43] to 3D, while Scovanner et al. [73] extend the SIFT descriptors [54] to 3D. One
drawback of these methods is the small number of stable interest points. This issue is
addressed by Dollàr et al. [26] who applied Gabor filtering on the spatial and temporal
dimensions individually. The spatial and temporal size of a patch is usually determined
by the scale of the interest points. Schüldt et al. [72] calculate patches of normalized
derivatives in space and time. Dollàr et al. [26] experiment with both image gradients
and optical flow. Few works have tried to address the viewpoint effect on human action
recognition: FarhadiandTabrizi[29]explicitlyaddressthecorrelationsbetweenactions
observed from different views using a split-based representation to describe clusters of
codewords in each view. The transfer of these splits between views is learned from
multi-view action sequences.
In contrast to these more general descriptors there are a number of works that
use representation strictly motivated by the application domain. Joint angles are rich
representation but it’s challenging to derive them from video sequences. In 3D the
representation are completely view-invariant, but often it’s not possible to provide a
multi-view camera system.
1.1.3 Action Modeling and Recognition
Whenanimagedescriptorisavailableforthewholevideosequences, thehumangesture
recognition becomes a classification problem: we have to assign a label for each frame
or sequence.
The first approach to model and classify an action is called Direct Classification.
With this method there isn’t an explicit modeling of actions, but all frames of an
CHAPTER 1. INTRODUCTION15
observed sequence are merged into one single representation or the recognition is per-
formed for each frame individually. An example is the k-Nearest Neighbor (NN) clas-
sifier: it uses the distance between image representation of an observed sequence and
those in the data-set; the label is chosen among the k closest training sequences. For a
large data-set this approach can be computationally expensive. NN classification can
be either performed at frame level or for whole sequences, for example by using major-
ity voting over all frame in a sequence. Crucially is the choice of the metric distance
between frames or sequences: Blank et al. [38] use 1-NN with Euclidean distance, but
often the Euclidean distance might not be the most suitable choice. Bobick and Davis
[6] use Hu moments of different orders of magnitude and then Mahalanobis distance is
used to take into account the variance of each dimension. It has been also observed
thatmanyactionscanberepresentedbykeyposesorprototypes: SullivanandCarlsson
[76] recognize forehand and backhand tennis stroke by matching edge representations
to labeled key poses.
Another approach is the Temporal state-space model which consists of state con-
nected by edges, each of them model the probabilities between states and between
states and observations. The most used method in the temporal state-space model
are Hidden Markov Models (HMM) [66] : they are a generative approach that tries to
model action classes using the observations in the training set, unlike to discriminative
techniques that don’t model a class but rather focus on differences between classes.
Probably, the first publication using HMMs in this field is the celebrated paper by
Yamato et al. [79] after that, HMMs have increase their applications in the field of
gesture recognition: Feng and Perona [30] use a static HMM where key-poses corre-
spond to states. They effectively train the dynamics at the cost of reduced flexibility
due a to a simpler observation model; Ahmad and Lee [1] take into account multiple
viewpoints and use a multi-dimensional HMM to deal with the different observations.
Instead of modeling the human body as a single observation, one HMM can be used for
each body-part, making easier the training and, in addition, composite movements that
are not in the training set can be recognized. İkizler and Forsyth [47] construct HMMs
for legs and arms individually, where 3D trajectories are the observations. For each
limb, states of different action models with similar emission probabilities are bound,
which allows automatic segmentation of actions.
CHAPTER 1. INTRODUCTION16
Figure 1.3: Low Level System Overview
1.2 Low level System Description
We have given an overview for gesture recognition systems, now we start to describe
briefly the main parts of our system. Unlike the approaches described above we don’t
try to model the whole actions but rather we work on different levels: we split an action
in its primitives, we model and recognize these primitives and then we combine them
to recognize the whole gesture. Additionally most of the approaches described in the
previous section have focused on learning particular actions, so they prefer to train
a system on a finite number of specific actions; in this way the actions that can be
recognized by these systems are bound to those in their data-set. With an incremental
approachispossibletogeneralizetheconceptofhumanactionsandthesystemcanlearn
new actions in real-time just from one-shot demonstration with a low computational
cost.
Our Low Level System in Fig. 1.3 follows the idea of the general recognition system
in Fig. 1.1, with the main difference that our system is focusing on gesture primitives
rather than the whole action.
1.2.1 Scene Segmentation and Tracking
We used a probabilistic background modeling: for each pixel in the image we fit a
Gaussian computing mean and variance over a fixed number of frames. After obtained
a good foreground segmentation we use a mean-shift tracker over a probability distri-
bution given by the optical flow. In this way we are able to catch human movements
and track them in real-time without delay.
CHAPTER 1. INTRODUCTION17
Figure 1.4: High Level System Overview
1.2.2 Features Extraction
Starting from the Region of Interest, we extract motion based features exploiting the
optical flow in the image. Given the optical flow vector w =( u,v,t) for each pixel
in the image, we can map this volume in a lower dimensionality space, i.e. the 2k
principal directions in the image. A frame descriptor is, thus, reduced to a single vector
2k
X(t)∈R.
1.2.3 Simple Gesture Modeling and Recognition
Once each frame is encoded into the above defined descriptor, we can obtain paramet-
ric descriptors by estimating for each action primitive a mixture of Gaussians. The
mixtures are then directly used for classification of gesture primitives.
1.3 High level System Description
Oncewemodeledgestureprimitiveswehaveagoodreal-timerecognitionsystemableto
handle them. Now is important to find the simplest and cheapest way to combine them
to recognize and learn more complex gestures. The use of a grammar based approach
should be the best approach. Each complex action is a sequence of primitives, so
they can represent a language. Each regular language can be modeled by Finite State
Machine (FSM), either deterministic or not, thus the recognition of complex actions
is just reduced to acceptance of languages. We have also found a solution, working
under proper assumptions, to the problem of learning FSMs automatically given only
one-shot example.
CHAPTER 1. INTRODUCTION18
1.4 Contribution of the Thesis
It’s clear that an exhaustive exploration in every part of the above system requires a
lot of time, so we don’t claim that the system is beyond the state of the art in each
steps. According to the schemes in Fig. 1.3 and 1.4 we have focused on Simple Ges-
turesModeling/RecognitionandComplexGesturesModeling/Recognition. Manyother
implementations are possible, using various solutions at the different steps; the main
contribution of this work is the implementation of a complete and real-time gesture
recognition system and the new approach by splitting the general gesture recognition
problemintotwolevels. Itisinthestepofmodelinggestureprimitivesandcomplexges-
tures that we propose an alternative solution to existing ones, showing up and proving
where and how our technique is better than the others.
1.5 Chapters Organization
Let’s now introduce briefly the following parts: in order to make lighter the reading
each chapter is self-contained. We first provide all theoretical notions required, then
our solution and implementation is described.
In Part II we describe each single step of the low level recognition system including
preliminaries and implementation:
• Chapter II - we show how to describe videos with compact descriptors
• Chapter III - we describe how to model simple gestures using these descriptors
• Chapter IV - we implement an on-line recognition system for primitives
In Part III we show how to combine gesture primitives and how implement a full
real-time action learning and recognition system:
• Chapter V - the high level system is presented with all preliminaries on grammars
and languages
Finally the Part IV is more experiment-oriented: we yield all results, practical appli-
cations and future works:
• Chapter VI - we show experimental results both for offline and on-line gesture
recognition