Real-Time Gesture Recognition System and Application: a motion-based approach

Gratis L'anteprima di questa tesi è scaricabile gratuitamente in formato PDF.
Per scaricare il file PDF è necessario essere iscritto a Tesionline. L'iscrizione non comporta alcun costo: effettua il Login o Registrati.

Mostra/Nascondi contenuto.

Abstract
We challenge the problem of human behaviours understanding and imitation in com-
puter vision and cognitive robotics applications. In this thesis we show how to design
and develop a complete real-time arm-hand gesture recognition system: starting from
videos we want to deﬁne descriptors able to capture human gestures and complex ac-
tions. We used a motion based approach without any prior knowledge of subjects in the
scene. Our aim is to deﬁne most general gesture primitives regardless of the object that
performs the action. Using primitives we can divide the gesture recognition problem
intotwolevel: lowandhighlevel. Intheﬁrstonewerecognizegesturesprimitivesusing
Mixture of Gaussians for each primitive, the high level system combines sequences of
primitives using Deterministic Finite Automata (DFA). We provide general gestures
descriptors, we model them and we show how to learn new gestures just from one-shot
demonstration. Finally we show how to implement two diﬀerent applications based on
the above system. The whole thesis has been developed and implemented at Imperial
College of London, within the BioART laboratory under the supervision of Dr. Yiannis
Demiris.
1
Part I
Overview
8
Chapter 1
Introduction
The analysis of human movement and in particular the recognition of actionsarere-
search areas that are generating growing interest and beginning to produce a variety
of applications: they require the integration of diﬀerent ﬁelds, since they are based on
the application of principles and methods of sectors such as Pattern Recognition and
Computer Vision, as well as Probability Theory. Examples of recent applications based
on the recognition of human activities have mainly focused on Surveillance Applica-
tions, Control Applications or Learning by Imitation Applications. For surveillance
applications is very important to automatically monitoring and understanding people’s
behavior and abnormal activities. A control application could be in video-games do-
main: the gamer can interact with the system with whole body’s movements, so the
system should be able to recognize his commands in a robust way and hopefully with-
out much delay. Another important application is to teach a Robot how to perform
some tasks, i.e. in a factory where operators can train their robots for work just from
demonstrations. The weakness of these systems is that usually they may require to
wear special suits while, from an application point of view, computer vision methods
are suitable only if they provide non-invasive solutions.
There is a neurological evidence that human actions are connected to the motor
control of human body [35, 68, 69]: when a human is viewing other agents performing
actions, the visual system seems to relate the visual input to a sequence of motor prim-
itives. The neurological representation for visually perceived, learned and recognized
actions appears to be the same as the one used to drive the motor control of the body.
These conclusions have gained notably attention from the robotics community [71] in
the ﬁeld of imitation learning. In fact learning by imitation researchers are focusing
9
CHAPTER 1. INTRODUCTION10
on developing robot systems able to recognize, learn and imitate or assist human be-
haviours [21]. These ﬁndings motivate our motion based approach to identify a set
of action primitives that allows to represent human gestures and to combine them in
complex actions and possibly ﬁnd the way to imitate them. In this sense, according
to the classiﬁcation on human motion analysis, provided by Moeslund et al. [58], our
approach seems to fall into the category of action primitives and grammars, as no ex-
plicit reference to human model is used in the behaviour modelling. Here we shall focus
only on actions performed by hands and arms, although we extend the action class
beyond the concept of gestures (as speciﬁed, e.g. in Mitra et al. survey [57]). In fact,
potentially any general action, performable by hand and arm, can be included in our
approach, in so encompassing the speciﬁc interpretation of gestures as movements for
interaction and expressing emotion.
1.1 State of the Art
In this Section we provide a short description of the current State of the Art based
on most recent surveys [50, 58, 65] in the Gesture Recognition’s ﬁeld. We present all
possible ways to develop a Gesture Recognition System then we brieﬂy introduce our
gesture recognition system and the contents of the following chapters and the guide-line
of the tractation.
In Fig. 1.1 we show a complete gesture recognition system; main steps are required
to achieve successfully the recognizing of human actions and gestures:
• Scene Segmentation and Tracking: the system segments the objects in the scene
and detects which are the relevant ones for the tracking
• Features Extraction: the system extracts robust descriptors that characterize the
actions satisfactorily
• Action Modeling and Recognition: actions will be modeled by the system and
then it should recognize gestures correctly
1.1.1 Modeless vs Model-Based Approaches
The ﬁrst step to achieve a good gesture recognition system is the detection of relevant
subjects that allows to obtain a good interpretation of the current scene. Mainly we
CHAPTER 1. INTRODUCTION11
Figure 1.1: General Gesture Recognition System
can have two diﬀerent kinds of scene segmentation: the ﬁrst one is modeless based,
while the second one uses humanoids model to capture gestures and actions.
In the modeless approach, systems try to learn and recognize activities by observing
the motion of objects without necessarily knowing their identity. A pioneering work
has been presented by Efros et al. [28]; they tried to recognize a set of simple actions
of people whose images in the video are very small and where the video quality is
poor. They use a set of features that are based on blurred optical ﬂow. Robertson and
Reid [70] extended the work of Efros by developing a system in which complex actions
could be composed by the set of simple actions. In 3D representation is also possible
to use a modeless approach; Kakadiaris e Metaxas [49] for instance realized a method
to recognize body parts without using an a priori model of human body, based on a
spatio-temporal analysis of the silhouette of moving person.
Model-based approaches try to exploit a priori knowledge about human body; they
require a model of human body to segment and track parts we are interested in. This
approach matches the images sequence with the model data. Some approaches [20],
start out with silhouettes and detect the body parts using a method inspired by the
W4-system [42] which seems to work well under the assumption of good foreground-
backgroundseparationandlargeenoughnumberofpixelsontheobservedagent. Other
systems use 3D-model based body tracking approaches where the recognition of action
is used as a loop-back to support pose estimation [3, 25, 62, 74].
In our system we decided to rely on modeless approaches: our primary aim is the
deﬁnition of more general motion primitives regardless of the object that performs the
action, moreover model based approaches are often unfeasible in case of noisy and
imperfect conditions while modeless techniques are reasonably practicable.
CHAPTER 1. INTRODUCTION12
Human Tracking
Inthelastyearstrackingalgorithmshavefocusedprimarilyonsurveillanceapplications
leading to advances in areas such as outdoor tracking, tracking through occlusion, and
detection of humans in still images. The notion of tracking is strictly correlated to the
segmentation of the scene (that can be achieved using either a model-based approach
or a modeless one). In fact after detection of subject of interest it’s necessary to track
and predict his poses over the time. Given the state of N persons in the previous frames
and the current input frame, we are looking for the states of the same persons in the
current frame. Here the state is mainly the image position of a person, but can contain
other attributes, e.g., 3D position, color, and shape. Previously tracking algorithms
were mostly tested in controlled environments and with only a few people present in
the scene. Recently, algorithms have addressed more natural outdoor scenarios where
multiple people and occlusions are present. One important problem is how to handle
multiple people that might occlude each other. When the tracking has commenced the
problem was to ﬁnd the temporal correspondences between predicted and measured
states. This has recently been approached using a correspondence matrix, which has
the predicted objects in one direction and the measured objects in the other direction.
For each entry in the matrix a distance between predicted and measured object is
calculated. This gives the likelihood that a predicted and measured object are the same
[4, 15, 40]. Alternatively, global optimizations can also be applied. Polat et al. [64] use
a Multiple Hypothesis Tracker to construct diﬀerent hypotheses which each explains
all the predictions and measurements, and chooses the hypothesis which is most likely.
Objects are allowed to enter and exit the scene meaning that the number of elements
in the state vector can change. To handle this the particle ﬁlter is enhanced by a
trans-dimensional Markov chain Monte Carlo approach [39], this allows new objects to
enter and other objects to leave the scene. Advances in human tracking are motivated
by the increased focus on surveillance applications. For example, in order to have
fully autonomous systems operating in uncontrolled environments the segmentation
methodshavetobeadaptive. Thishastosomeextentbeenachievedwithinbackground
subtraction where analysis of video sequences of several hours has been reported [31].
However, for 24h operation special cameras (and algorithms) are required. Work in
this direction has started [10, 19] but no one has so far been able to report a truly
autonomous system.
CHAPTER 1. INTRODUCTION13
Figure 1.2: Examples of Motion History Images
1.1.2 Features Extraction
We describe here existing approaches for features extraction from image sequences. In
an ideal world features should be viewpoint, background, person and action execution
invariant. At the same time they must allow a robust classiﬁcation of the action. We
can divide image representations in global and local representations.
In the global representation ﬁrst we localize the subject using modeless or model
approaches described above, then the region of interest (ROI) is encoded as a whole
which results in the image descriptors. Usually global descriptors are obtained by a
silhouette, bythecontours, bytheopticalﬂowetc. Oneofmostusedkindofdescriptors
istheHumoment, wheretwotemplatescanbecompared; Bobick[6]forinstanceuseda
binary motion energy image (MEI), which indicates where is the motion, and a motion
history images (MHI), where pixel intensities are function of the temporal history of
the motion in that point, and models shapes are matched with MEI and MHI by Hu
moment comparison. Wang et al. [77] computed a transform to silhouette to obtain
translation and scale invariant descriptors. Silhouette and also contours are used by
Li [52], who extracted information like sampled points from the contours of silhouette
and center motion of gravity center. Silhouette can be obtained also from multiple
cameras, for instance Weinland et al. [78] combined silhouettes from multiple cameras
in a 3D model and used the motion history volumes (an extension of MHI) as main
descriptor. Completely diﬀerent descriptors can be extracted by motion information,
like the optical ﬂow; for instance Ali and Shah [2] derived several kinematics features
from the optical ﬂow like symmetry, divergence etc.
Anintermediateapproachbetweenglobalandlocalrepresentationsistheglobalgrid-
basedrepresentation. TheROIisdividedintoﬁxedspatialortemporalgrid,thisprovide
CHAPTER 1. INTRODUCTION14
a partial solution for occlusions and changes in viewpoint. A grid-based approach on
silhouette is used by Ragheb et al. [67], who compute for each spatial location the
silhouette transformation in frequency domain. Instead grid-based approach on optical
ﬂow was used by Danafar [16], who extends Efros’ work [28] by dividing the ROI into
horizontal slices that contain head, body and legs.
Local representations describe the image as a collection of local descriptors. They
don’trequireaccuratelocalizationandbackgroundsubtractionandoftentheyareview-
pointinvariantandpersonappearance. Patchesaresampledeitherdenselyoratspace-
time interest points. Usually these points correspond to changes of movement that
occur in the video, so they assume that these locations are the relevant ones for the
recognition of human action. Laptev and Lindeberg [51] extended Harris’ corner detec-
tor [43] to 3D, while Scovanner et al. [73] extend the SIFT descriptors [54] to 3D. One
drawback of these methods is the small number of stable interest points. This issue is
addressed by Dollàr et al. [26] who applied Gabor ﬁltering on the spatial and temporal
dimensions individually. The spatial and temporal size of a patch is usually determined
by the scale of the interest points. Schüldt et al. [72] calculate patches of normalized
derivatives in space and time. Dollàr et al. [26] experiment with both image gradients
and optical ﬂow. Few works have tried to address the viewpoint eﬀect on human action
recognition: FarhadiandTabrizi[29]explicitlyaddressthecorrelationsbetweenactions
observed from diﬀerent views using a split-based representation to describe clusters of
codewords in each view. The transfer of these splits between views is learned from
multi-view action sequences.
In contrast to these more general descriptors there are a number of works that
use representation strictly motivated by the application domain. Joint angles are rich
representation but it’s challenging to derive them from video sequences. In 3D the
representation are completely view-invariant, but often it’s not possible to provide a
multi-view camera system.
1.1.3 Action Modeling and Recognition
Whenanimagedescriptorisavailableforthewholevideosequences, thehumangesture
recognition becomes a classiﬁcation problem: we have to assign a label for each frame
or sequence.
The ﬁrst approach to model and classify an action is called Direct Classiﬁcation.
With this method there isn’t an explicit modeling of actions, but all frames of an
CHAPTER 1. INTRODUCTION15
observed sequence are merged into one single representation or the recognition is per-
formed for each frame individually. An example is the k-Nearest Neighbor (NN) clas-
siﬁer: it uses the distance between image representation of an observed sequence and
those in the data-set; the label is chosen among the k closest training sequences. For a
large data-set this approach can be computationally expensive. NN classiﬁcation can
be either performed at frame level or for whole sequences, for example by using major-
ity voting over all frame in a sequence. Crucially is the choice of the metric distance
between frames or sequences: Blank et al. [38] use 1-NN with Euclidean distance, but
often the Euclidean distance might not be the most suitable choice. Bobick and Davis
[6] use Hu moments of diﬀerent orders of magnitude and then Mahalanobis distance is
used to take into account the variance of each dimension. It has been also observed
thatmanyactionscanberepresentedbykeyposesorprototypes: SullivanandCarlsson
[76] recognize forehand and backhand tennis stroke by matching edge representations
to labeled key poses.
Another approach is the Temporal state-space model which consists of state con-
nected by edges, each of them model the probabilities between states and between
states and observations. The most used method in the temporal state-space model
are Hidden Markov Models (HMM) [66] : they are a generative approach that tries to
model action classes using the observations in the training set, unlike to discriminative
techniques that don’t model a class but rather focus on diﬀerences between classes.
Probably, the ﬁrst publication using HMMs in this ﬁeld is the celebrated paper by
Yamato et al. [79] after that, HMMs have increase their applications in the ﬁeld of
gesture recognition: Feng and Perona [30] use a static HMM where key-poses corre-
spond to states. They eﬀectively train the dynamics at the cost of reduced ﬂexibility
due a to a simpler observation model; Ahmad and Lee [1] take into account multiple
viewpoints and use a multi-dimensional HMM to deal with the diﬀerent observations.
Instead of modeling the human body as a single observation, one HMM can be used for
each body-part, making easier the training and, in addition, composite movements that
are not in the training set can be recognized. İkizler and Forsyth [47] construct HMMs
for legs and arms individually, where 3D trajectories are the observations. For each
limb, states of diﬀerent action models with similar emission probabilities are bound,
which allows automatic segmentation of actions.
CHAPTER 1. INTRODUCTION16
Figure 1.3: Low Level System Overview
1.2 Low level System Description
We have given an overview for gesture recognition systems, now we start to describe
brieﬂy the main parts of our system. Unlike the approaches described above we don’t
try to model the whole actions but rather we work on diﬀerent levels: we split an action
in its primitives, we model and recognize these primitives and then we combine them
to recognize the whole gesture. Additionally most of the approaches described in the
previous section have focused on learning particular actions, so they prefer to train
a system on a ﬁnite number of speciﬁc actions; in this way the actions that can be
recognized by these systems are bound to those in their data-set. With an incremental
approachispossibletogeneralizetheconceptofhumanactionsandthesystemcanlearn
new actions in real-time just from one-shot demonstration with a low computational
cost.
Our Low Level System in Fig. 1.3 follows the idea of the general recognition system
in Fig. 1.1, with the main diﬀerence that our system is focusing on gesture primitives
rather than the whole action.
1.2.1 Scene Segmentation and Tracking
We used a probabilistic background modeling: for each pixel in the image we ﬁt a
Gaussian computing mean and variance over a ﬁxed number of frames. After obtained
a good foreground segmentation we use a mean-shift tracker over a probability distri-
bution given by the optical ﬂow. In this way we are able to catch human movements
and track them in real-time without delay.
CHAPTER 1. INTRODUCTION17
Figure 1.4: High Level System Overview
1.2.2 Features Extraction
Starting from the Region of Interest, we extract motion based features exploiting the
optical ﬂow in the image. Given the optical ﬂow vector w =( u,v,t) for each pixel
in the image, we can map this volume in a lower dimensionality space, i.e. the 2k
principal directions in the image. A frame descriptor is, thus, reduced to a single vector
2k
X(t)∈R.
1.2.3 Simple Gesture Modeling and Recognition
Once each frame is encoded into the above deﬁned descriptor, we can obtain paramet-
ric descriptors by estimating for each action primitive a mixture of Gaussians. The
mixtures are then directly used for classiﬁcation of gesture primitives.
1.3 High level System Description
Oncewemodeledgestureprimitiveswehaveagoodreal-timerecognitionsystemableto
handle them. Now is important to ﬁnd the simplest and cheapest way to combine them
to recognize and learn more complex gestures. The use of a grammar based approach
should be the best approach. Each complex action is a sequence of primitives, so
they can represent a language. Each regular language can be modeled by Finite State
Machine (FSM), either deterministic or not, thus the recognition of complex actions
is just reduced to acceptance of languages. We have also found a solution, working
under proper assumptions, to the problem of learning FSMs automatically given only
one-shot example.
CHAPTER 1. INTRODUCTION18
1.4 Contribution of the Thesis
It’s clear that an exhaustive exploration in every part of the above system requires a
lot of time, so we don’t claim that the system is beyond the state of the art in each
steps. According to the schemes in Fig. 1.3 and 1.4 we have focused on Simple Ges-
turesModeling/RecognitionandComplexGesturesModeling/Recognition. Manyother
implementations are possible, using various solutions at the diﬀerent steps; the main
contribution of this work is the implementation of a complete and real-time gesture
recognition system and the new approach by splitting the general gesture recognition
problemintotwolevels. Itisinthestepofmodelinggestureprimitivesandcomplexges-
tures that we propose an alternative solution to existing ones, showing up and proving
where and how our technique is better than the others.
1.5 Chapters Organization
Let’s now introduce brieﬂy the following parts: in order to make lighter the reading
each chapter is self-contained. We ﬁrst provide all theoretical notions required, then
our solution and implementation is described.
In Part II we describe each single step of the low level recognition system including
preliminaries and implementation:
• Chapter II - we show how to describe videos with compact descriptors
• Chapter III - we describe how to model simple gestures using these descriptors
• Chapter IV - we implement an on-line recognition system for primitives
In Part III we show how to combine gesture primitives and how implement a full
real-time action learning and recognition system:
• Chapter V - the high level system is presented with all preliminaries on grammars
and languages
Finally the Part IV is more experiment-oriented: we yield all results, practical appli-
cations and future works:
• Chapter VI - we show experimental results both for oﬄine and on-line gesture
recognition

Anteprima dalla tesi:

Real-Time Gesture Recognition System and Application: a motion-based approach

CONSULTA INTEGRALMENTE QUESTA TESI

La consultazione è esclusivamente in formato digitale .PDF

Acquista

Informazioni tesi

Autore:	Sean Ryan Fanello
Tipo:	Tesi di Laurea Magistrale
Anno:	2009-10
Università:	Università degli Studi di Roma La Sapienza
Facoltà:	Ingegneria
Corso:	Ingegneria Informatica
Relatore:	Fiora Pirri
Lingua:	Inglese
Num. pagine:	97

FAQ

Come consultare una tesi

Per consultare la tesi è necessario essere registrati e acquistare la consultazione integrale del file, al costo di 29,89€.
Il pagamento può essere effettuato tramite carta di credito/carta prepagata, PayPal, bonifico bancario.
Confermato il pagamento si potrà consultare i file esclusivamente in formato .PDF accedendo alla propria Home Personale. Si potrà quindi procedere a salvare o stampare il file.
Maggiori informazioni

Perché consultare una tesi?

Ingiustamente snobbata durante le ricerche bibliografiche, una tesi di laurea si rivela decisamente utile:

perché affronta un singolo argomento in modo sintetico e specifico come altri testi non fanno;
perché è un lavoro originale che si basa su una ricerca bibliografica accurata;
perché, a differenza di altri materiali che puoi reperire online, una tesi di laurea è stata verificata da un docente universitario e dalla commissione in sede d'esame. La nostra redazione inoltre controlla prima della pubblicazione la completezza dei materiali e, dal 2009, anche l'originalità della tesi attraverso il software antiplagio Compilatio.net.

Clausole di consultazione

L'utilizzo della consultazione integrale della tesi da parte dell'Utente che ne acquista il diritto è da considerarsi esclusivamente privato.
Nel caso in cui l’utente che consulta la tesi volesse citarne alcune parti, dovrà inserire correttamente la fonte, come si cita un qualsiasi altro testo di riferimento bibliografico.
L'Utente è l'unico ed esclusivo responsabile del materiale di cui acquista il diritto alla consultazione. Si impegna a non divulgare a mezzo stampa, editoria in genere, televisione, radio, Internet e/o qualsiasi altro mezzo divulgativo esistente o che venisse inventato, il contenuto della tesi che consulta o stralci della medesima. Verrà perseguito legalmente nel caso di riproduzione totale e/o parziale su qualsiasi mezzo e/o su qualsiasi supporto, nel caso di divulgazione nonché nel caso di ricavo economico derivante dallo sfruttamento del diritto acquisito.

Vuoi tradurre questa tesi?

L'obiettivo di Tesionline è quello di rendere accessibile a una platea il più possibile vasta il patrimonio di cultura e conoscenza contenuto nelle tesi.
Per raggiungerlo, è fondamentale superare la barriera rappresentata dalla lingua. Ecco perché cerchiamo persone disponibili ad effettuare la traduzione delle tesi pubblicate nel nostro sito.
Per tradurre questa tesi clicca qui »
Scopri come funziona »

DUBBI? Contattaci

Contatta la redazione a
[email protected]

Ci trovi su Skype (redazione_tesi)
dalle 9:00 alle 13:00

Oppure vieni a trovarci su

Parole chiave

action recognition

action space

cognitive robotic

computer vision

deterministic finite automata

finite state machines

gesture recognition

human activities recognition

human behavior understanding

human tracking

icub

learning by imitation

mixture models

pattern recognition

powerpoint

real time recognition

spectral clustering

statistical learning

video processing

finite automata

recognition actions

cognitive robotics

Tesi correlate

Non hai trovato quello che cercavi?

Abbiamo più di 45.000 Tesi di Laurea: cerca nel nostro database

Oppure consulta la sezione dedicata ad appunti universitari selezionati e pubblicati dalla nostra redazione

Ottimizza la tua ricerca:

individua con precisione le parole chiave specifiche della tua ricerca
elimina i termini non significativi (aggettivi, articoli, avverbi...)
se non hai risultati amplia la ricerca con termini via via più generici (ad esempio da "anziano oncologico" a "paziente oncologico")
utilizza la ricerca avanzata
utilizza gli operatori booleani (and, or, "")

Idee per la tesi?

Scopri le migliori tesi scelte da noi sugli argomenti recenti

Come si scrive una tesi di laurea?

A quale cattedra chiedere la tesi? Quale sarà il docente più disponibile? Quale l'argomento più interessante per me? ...e quale quello più interessante per il mondo del lavoro?

Scarica gratuitamente la nostra guida "Come si scrive una tesi di laurea" e iscriviti alla newsletter per ricevere consigli e materiale utile.

Leggi la guida

La tesi l'ho già scritta,
ora cosa ne faccio?

La tua tesi ti ha aiutato ad ottenere quel sudato titolo di studio, ma può darti molto di più: ti differenzia dai tuoi colleghi universitari, mostra i tuoi interessi ed è un lavoro di ricerca unico, che può essere utile anche ad altri.

Il nostro consiglio è di non sprecare tutto questo lavoro:

È ora di pubblicare la tesi

Scopri di più

Real-Time Gesture Recognition System and Application: a motion-based approach

Anteprima dalla tesi:

Real-Time Gesture Recognition System and Application: a motion-based approach

CONSULTA INTEGRALMENTE QUESTA TESI

La consultazione è esclusivamente in formato digitale .PDF

Informazioni tesi

FAQ

Come consultare una tesi

Perché consultare una tesi?

Clausole di consultazione

Vuoi tradurre questa tesi?

DUBBI? Contattaci

Parole chiave

Tesi correlate

Non hai trovato quello che cercavi?

Ottimizza la tua ricerca:

Idee per la tesi?

Come si scrive una tesi di laurea?

La tesi l'ho già scritta,ora cosa ne faccio?

Login

La tesi l'ho già scritta,
ora cosa ne faccio?