Summary
1 Summary
This thesis faces the problem of segmentation of human movement. In particular, it
focuses on investigating how observers identify ”phrases” in dance performances,
and on the development of computer algorithms to emulate such behaviour (Camurri
et al., 2004) based on a single videocamera signal.
As a preliminary step, some work has been carried out to segment movements in
motion and pause phases, using algorithms based on the Quantity of Motion (QoM)
(Trocca, 2001, Mazzarino, 2002, Volpe, 2003). An experiment with subjects
performing the same segmentation task was carried out to validate this algorithm,
whose results are described in chapter 8.
Some research about posture was also carried out to evaluate if and how it can
influence the execution of movements, and how it determines their segmentation. In
this scenario, my work consisted in finding further algorithms for performing
segmentation.
The study can be summarized in three steps: (i) individuating motion parameters that
can be suitable for the segmentation task, (ii) developing algorithms for segmentation
based on the identified parameters, (iii) validating the algorithms on a reference
archive of movements with a particular focus on dance performances.
The first motion parameter I took into account has been Equilibrium, both static and
dynamic, using different sources, from biomechanics to choreography. Static
equilibrium techniques resulted particularly useful in identifying periodic movements
involving feet (like walking, running, etc.) into strokes. Dynamic equilibrium (in our
study, the measure of equilibrium using dynamic cues in different time windows) has
been investigated to evaluate equilibrium states reached from decelerating preceding
phases. The study in this direction only gave preliminary results, as discussed in
chapter 6.4.
Several motion parameters have been considered, all extracted from a single (fixed)
videocamera signal of one dancer on stage, such as PAD (percentage of accelerations
and decelerations), zero-crossings of accelerations and decelerations (Zhao, 2001;
Bindigavanale, 2001), zero crossings of the components of velocity of the “guiding
limb” in the dance. This feature, under certain hypotheses
(see chapter 3), can be
used for segmentation, using the changes of direction followed by the dancer.
The developed algorithms have been implemented as software modules for the
EyesWeb open architecture (coded in Visual C++ language under Microsoft
Windows operating system). The validation of the algorithms has been conducted
using videos recorded in our Laboratory (dancer and choreographer Giovanni Di
Cicco), and from an archive from ZKM (Karlsruhe) on Forsythe’s ballet fragments
on contemporary dance.
Introduction
2 Introduction
The work described in this thesis is in the framework of a more general project on
modelling and analysis of non-verbal communication with a particular focus on
expressiveness in full-body human movement. In particular, we focused on the
development of automated techniques and real-time algorithms for human movement
segmentation.
The problem faced in this thesis originated from the more general problem of
segmenting human movement into coherent “units”, both from the point of view of
the performer of the movement and from the point of view of any observer.
With motion segmentation, in fact, we mean the division of movements’ streams into
phases according to some features or criteria perceived by observers, rather than the
‘separation’ of a human body with respect to the background of which it is a part.
Many scenarios appear to have links and connections with our work, which is
included in the context of multimedia content analysis. In particular the work can
take great advantage from a cross-disciplinary approach and it can highly benefit of
cross-fertilization among scientific and technical knowledge on the one side, and art
and humanities on the other side.
This need of cross-fertilization opens novel frontiers to research in both fields: if
from one hand scientific and technological research can benefit of models and
theories borrowed from psychology (e.g., Krumhansl’s studies (1997)), social
science, art (music and dance, in particular) and humanities, on the other hand these
disciplines can take advantage of the tools technology is able to provide for their own
research, i.e., for investigating the hidden subtleties of human behaviour at a depth
that has never been reached before. (Camurri, Mazzarino, Menocci, Rocca, Vallone
and Volpe, 2004)
Nowadays the relevance of movement and gesture as a main channel of non-verbal
communication is becoming evident, and a growing number of researches are
developing in this direction (e.g., see the Gesture Workshop series of conferences
started in 1996).
From a cross-disciplinary perspective, research on expressive gesture descriptors can
be built on several bases, ranging from biomechanics, to psychology, to theories
coming from performing arts.
For example, in our work we have considered theories from dance and choreography
such as Rudolf Laban’s Theory of Effort (Laban, 1947, 1963), theories from music
and composition (after all, movements in dance performances are orchestrated just
like notes in music), works by psychologists on non-verbal communication in
general (e.g., Argyle, 1980), on expressive cues in human full-body movement (e.g.,
Boone and Cunningham, 1998; Wallbott, 1980; Krumhansl, 2003), biomechanical
works on human body motion, etc.
Introduction
A special focus, anyway, has been devoted on expressive gesture in dance.
In particular, we have studied and integrated two relevant domains: that of dance and
choreography together with that of computer vision and motion analysis techniques.
In fact, humanistic theories from dance and choreography, such as the theory of
Effort by Rudolf Laban
1
, explain – from the artist’s perspective – the non-verbal
gesture language within human movement.
In order to be effective, the approaches have to start from a quite constrained
framework where expressiveness can be exploited to its maximum extent. One such
scenario has been found in dance and it is also a good example for carrying out a task
like that of segmentation.
Dancers and actors are, in most cases, aware of techniques that can be used to
emphasize some movements or gestures and they are also able to evoke, with their
actions, particular emotions or moods, using them at will to convey expressive
contents to the audience. Expressive content concerns aspects related to feeling,
affect and intensity of emotional experience. For example, the same action can be
performed in several ways, by stressing different qualities of movement.
In this manner it is possible to recognize a person from the way he/she walks, but it
is also possible to get information about the emotional state of a person just by
looking at his/her gait, e.g., if he/she is angry, sad, happy (as Pollick stressed in his
papers). (Pollick, June 2001, August 2001)
In cases of gait analysis, we can therefore distinguish among several objectives and
layers of analysis: a first one aiming at describing the physical features of movement,
for example in order to classify it, a second one aiming at extracting the expressive
content that gait coveys
2
, e.g., in terms of information about the emotional state the
walker communicates through his/her way of walking. (Boone and Cunningham,
1998; Pollick, 2001)
For what concerns human motion analysis in its most general aspects, this field is
nowadays receiving increasing attention (from an engineering point of view) by
computer vision researchers. The interest is motivated by a wide spectrum of
applications, such as athletic performance analysis (a), surveillance (b), content-
based image storage and retrieval (c), video conferencing (d), etc.
Aggarwal and Cai (1997) give an overview of the various tasks involved in motion
analysis of the human body, as specified below.
(a) Segmenting
3
the parts of the human body in an image, tracking the movement of
joints over an image sequence, and recovering the underlying 3D body structure is
1
Details are contained in the paragraph titled “The dance field”.
2
Once some meaningful cues are identified, they need to be measured (possibly in real-time) on
the expressive gestures the user performs.
3
In this case, “segmentation” is used with a different meaning from which that will emerge
throughout this dissertation.
Introduction
particularly useful for analysis of athletic performances as well as medical
diagnostics.
(b) The capability to automatically monitor human activities using computers in
security-sensitive areas such as airports, borders, and building lobbies is of great
interest to the police and military.
(c) With the development of digital libraries, the ability to interpret video sequences
in an automatic way will save tremendous human effort in sorting and retrieving
images or sequences of them using content-based queries.
(d) Another kind of multimedia application includes video conferencing, whose pros
and cons, in case of segmentation task, will be underlined later on.
In our study the methodologies for automatic segmentation of human movements
have been analysed with a main focus on multimedia content analysis as a central
application field: in the following pages we report our recent approaches and the
various techniques employed to accomplish the task starting from sequences of
images about dancers.
In order to find the most applicable and pertinent techniques we have first carried out
a research about the existing methods, in order to have a clear idea of the state-of-
the-art in the area probed by our work.
The review has been conducted with the aim to compare and judge the various
studies existing in literature, highlighting (i) why we have taken into account some
methodologies while having rejected others and, for those which resulted close to
ours, (ii) the cases they can be applied to, together with (iii) their main advantages
and drawbacks.
Figure 2.1: the three big areas of human motion analysis addressed by the Aggarwal and Cai
(1997).
Introduction
The figure depicted in the previous page shows the three areas on which human
motion analysis mainly concentrates, according to Aggarwal and Cai’s point of view
(1997): body structure analysis (developed mainly by biomechanics), tracking (for
example for video conferencing scenarios, but also addressed by multimedia content
analysis) and recognition (useful, for example, in the video-surveillance field and
also in case of segmentation based on recognition of single gestures).
In (Aggarwal and Cai, 1997) the two authors also point out that, talking about motion
analysis, there is always a trade-off between feature complexity and tracking
efficiency: lower level features, such as points, are easier to extract, but relatively
more difficult to track than higher-level features such as blobs and 3D volumes. This
has been confirmed by our work: we have concentrated on tracking of points
corresponding to joints or to the Centre of Mass of the body and the algorithm
employed for motion segmentation needs precise tracking of the chosen points.
This is the reason why an accurate manual tracking has sometimes been necessary: to
obtain reliable values of position. We can certainly affirm that getting consistent
values has been one of the bottlenecks of our approach to segmentation.
Another possible way to approach the issue of motion segmentation might be
through considering motion recognition: recognizing a certain movement allows its
distinction among others in a flow of different actions: therefore we may have
segmentation based on recognition of certain moves even if they are meshed with
other unrecognisable movements.
Two typical approaches to motion recognition are addressed in the publication by
Aggarwal and Cai (1997): that based on template matching some given images to
pre-stored patterns (the preferred approach by Aggarwal, who used it in (Aggarwal
and Ali, 2001)) and that based on a state-space models.
To say the truth, we have not used the first approach neither the second, since ours is
based on motion segmentation conceived as a step before or even disconnected with
motion recognition. We have tried to make motion segmentation without any
recognition of the movements.
Our approach is rather an attempt to divide streams of dance movements in phases
according to some kinematical features, with a particular focus on how observers
would execute such a task. In fact, the basic units of movement we can detect (e.g.,
in a dance) and cluster together can be considered and analysed under three
perspectives, sometimes connected one to another: that of the performer (it means
that attention is focused on the physical execution of the movement and expressive
gestures
4
represent items of acted moves), that of the observer (it is based on the
perception of movements) and finally that of the choreographer. This topic is
detailed in the paragraph “The dance field”.
4
The concept of expressive gesture is deepened in the paragraph “Experimental psychology”.
Introduction
It is to say that this distinction should not to be considered in rigid terms: that is,
people who study these mechanisms must necessary be acquainted with all the three
viewpoints, and models of gestures have actually been studied by paying deep
attention to the three perspectives.
In our work the attention has been principally focused on gestures as units of
perceived movement: we have been interested in studying algorithms able to
extrapolate the same (o similar) phases that an observer would individuate.
From the observer’s point of view, segmentation of movements often refers to major
segments of time that people usually describe by single action verbs
5
.
Human activity is a continuous flow of single human action primitives in succession,
just as dance is a continuous occurring in sequences of different “atomic” steps and
movements, sometimes modified by transitions to the neighbouring steps
6
.
When humans move from one type of movement to another, they do so smoothly: in
general transitions are not well defined, there is no clear beginning or end of a
movement; therefore the detection of shifts between them is crucial and this aspect
makes segmentation difficult.
This dissertation is introduced in a field already explored by many researchers (not
only engineers) and touches an actual matter, still open, without a universal solution:
it focuses on the development of paradigms and algorithms to create techniques able
to segment movements automatically, to divide them in elementary events, in
movement strokes.
Having to deal with motion segmentation a spontaneous question may be: how can
motion streams be divided? According to actions, to gestures, to step dances (if
dealing with dance performances) or what more?
So, before introducing the discussion about the employed techniques and the choice
of the main application scenario of our work (that of multimedia content analysis), an
interlude about the differences among action, movement and gesture has been
deemed worthwhile (that is the reason why an entire paragraph, titled “Experimental
psychology”, will be devoted to this aspect).
Here we just want to highlight that, although many experiments about expressive
gesture aim at individuating which motion cues are mostly involved in conveying the
dancer’s expressive intentions to the audience (during a dance performance) and
measuring them in order to classify dance gestures and steps in terms of basic
emotions, we have instead considered gestures by virtue of their physical
characteristics of execution (without any respect to emotional mechanisms) and we
have found that motion phases are similar to those commonly used in case of
investigation of complex and tangled sound patterns
7
.
5
Differences among actions, movements and gestures are highlighted in the paragraph titled
“Experimental psychology”.
6
Analogous considerations are valid for co-articulation in speech.
7
See the paragraph titled “Analogies between motion and music” for details.
Introduction
These phases are preparation, attack, decay, sustain, release and overshoot
(Mazzarino, 2003). Laban, talking about gesture, considers three phases: preparation,
stroke (execution) and retraction (of the arms back close to the body) (Zhao, 2001).
The temporal duration of every phase varies according to its execution: the more
difficult or more precise the execution of a certain movement, the longer its
arrangement (so the duration of the preparation phase).
Posture influences all the three phases (Bindigavanale and Badler, 1998) and an
entire paragraph is devoted to this issue, since postural attitudes influence our way to
move, act and also to be steady.
As in Camurri, Lagerlöf and Volpe’s study our work is focused on dance: in their
paper (2002) the three authors approach the genre of modern dance to find a
nonpropositional style of movements (since the basis of natural body expression can
be developed in terms of dance movements).
We dealt with videos about choreographies of contemporary dance.
A spontaneous question may be: why a dancer? Why not a simple person moving
and doing normal daily activities?
Our attention has been focused on dance, since it is an artistic expression of human
gesture and the field of dance is enriched by many facets
8
.
As already stated, dancers and actors are, in most cases, taught the techniques that
can be used to evoke an emotion and they are able to amplify some gestures, postures
or defects.
They are also able to emphasize some movements or pauses. In fact, not only motion
is important in expressive content communication: pauses also play an overriding
role. During them the body can assume a particular posture and body postures are
often considered as expressive gestures with a relevant function in conveying
expressive content to the audience (Argyle, 1980).
Moreover, we have tried to focus our work on the analysis of motion regarding the
whole body, not just a few joints: literature, instead, is full of researchers who took
into account just parts of the body (Cutler and Turk, 1998; Zobl, Wallhoff and
Rigoll, 2003; Masoud and Papanikolopoulos, 2003).
A main aim for which we have focused on segmentation is that, dealing with movies,
it may sometimes be an impossible or heavy task to extrapolate the necessary
features without dividing the long stream of moves in minor parts.
Within each motion segment, semantic and style information may, then, be extracted.
The problem is: which feature is best suited for motion segmentation? Will it allow
segmenting according to the same criteria that a human observer (who, normally,
notices the changes of actions looking, most of all, at which are the limbs moving, at
how much they are moving, etc.) would use? Should it be just one feature or more,
according to the type of video and to the performance?
8
More details should be taken into account talking about dance: refer to the paragraph titled
“The dance field” for more information.
Introduction
First of all, we have searched for features (one or more) carrying general
information, whose utilization is possible for every kind of moving person (from the
agilest dancer to any patient with motor disease). Kinematical cues accomplish such
aim.
The methods we have tried to employ aim at being independent from the properties
of the body silhouette: that is, we have only concentrated on kinematical cues
(positions, velocities, accelerations and some quantities derived from them), features
that are general and valuable for every kind of person, related just indirectly to the
size or shape of the silhouette. Obviously also kinematical cues depend on the body
mass (i.e., a heavy person might have lower values of velocity than a thinner one),
but the dependence is indirect: that is, none of the calculated cues depends on the
weight or height of the person for which we have evaluated it (this is especially true
if the changes of scale are little, because otherwise we would need a normalization).
This thesis will not be presented as a dissertation divided into parts with the state-of-
the-art as opposed to the new techniques we have introduced: we will rather report
the various and different techniques we took into account, highlighting our
approaches as opposed to “other methods”
9
, for which we will underline the cases in
which they can be applied and when it is not possible.
As often happens, our research has started from naïve hypotheses and the very first
results were quite ambiguous and unclear... probably because of the videos initially
used (Di Cicco
10
) and because of the bibliographic references we have taken into
account
11
.
For this reason, afterwards, we have constrained the scenario and considered other
videos (Forsythe’s cd-rom
12
).
As already stated, comparisons with other possible techniques, even with different
perspectives, have been really useful: some of them have constituted the starting
point of our study, some should represent a direction for future potential researches
since there is not a unique way to proceed in the field of segmentation of human
movement.
9
We have dedicated some paragraphs to the other techniques found in literature and applicable
in some circumstances and for purposes similar to ours.
10
In these videos the dancer uses, for his moves, all the stage, the foreground as the
background: this aspect makes the extraction of features difficult and unreliable for reasons
highlighted in the third chapter. Other researches, instead, worked with videos in which there is
only a lateral viewpoint of the body moving (this is a big limitation).
11
Some techniques we read about seemed to be applicable to our videos for our purposes, but we
have discovered later that our results were really different from those obtained by other
researchers, maybe because of different video resolution.
12
See Chapter 3 for details about them.