by a very large set of behaviors, going from speech intensity and intonation to facial expressions,
hand/arm gestures, head and torso movements, posture changes, etc [4].
While communicating with others, our goal is to (both voluntarily and involuntarily) “transmit”
some information from our mind to the others’ minds. Our emotional and mental states (our beliefs
and goals) can be communicated with a large variety of nonverbal behaviors [2], which influence the
interaction with other people. For example, while greeting someone we can simply say “Hello” and
accompany it by raising our hand open, stretching both arms sideways, smiling a little bit, etc. This
great variability in performing nonverbal behaviors is determined by various concurrent factors: our
personality traits, our emotional state, our disposition/relation with other person/event/object [2],
some physical and social constraints (for example, greeting someone in a church is different from
waving at someone in a crowded place), our origins [68], our social role, our idiosyncratic, innate
“way” of behaving, etc. To become more credible and usable, ECAs have been endowed with all the
above characteristics [83] [68] of human nonverbal behavior. Many researchers have attempted to
model some of these aspects in ECAs, considering, for example, the influences on behavior induced
by the agent’s emotional state, personality traits and individualized repertoire of gestures and facial
expressions [1] [7] [31] [56] [69] [71] [79] [88]. All of these works have obtained very positive results,
even if considering all these aspects together is still a very challenging task. For example, how can
we model an agent that is extroverted, sad and has a general tendency to raise its eyebrows while
speaking? It is still not clear how we can consider these three aspects together.
1.2 Problem definition
In his book about bodily communication, Argyle [4] states that there is an underlying tendency
which is constantly present in each person’s behavior: for example people that look more tend to
do so in most situations, that is, there is a certain amount of consistency with the person’s general
tendency. Gallaher [43] found consistencies in the way people behave. She conducted evaluation
studies in which subjects’ behavior style was evaluated by friends, and by self-evaluation. In a first
study, many characteristics of behavior were evaluated: tendency to use body, face, head, gestures;
qualities of movement, like fast-slow, small-large, smooth-jerky, etc. The person’s behavior tendency
was shown to be an innate individual characteristic that the author claimed to be a personality trait.
In the second study she investigated the consistency of a person’s behavior across time and situations.
Results demonstrated this consistency: people that are quick when writing have a tendency to be
3
quick while eating; if a person produces wide gestures then she also walks with large steps. Energy of
movements is also an enduring characteristic, constant over time. The paper by Wallbott and Scherer
in [100] illustrates a study on actors’ body movements during the expression of several emotions. A
group of people judged the actors’ behaviors and annotated them. In the study, the authors found
that the way actors portrayed emotional states also seemed to be actor dependent, that is, it depended
on the actor’s personal way of expressing those emotions. Some behavior characteristics seemed also
independent from the emotion: for example the number of head movements and total activity. Finally,
some actors seemed capable of showing some emotions better than others, just because their behavior
style was similar to their expression for a certain emotion. Similar results have been proposed by
Gross et al. [45] who found that the capacity of people in expressing their emotions depends, among
other things, on the dispositional expressivity of a person. They showed that, for example, low-
expressivity individuals tend to inhibit negative emotions, while high-expressivity individuals do not.
This behavior tendency is maintained also across large time spans.
All the above studies suggest that the behavior of a person does not depend only on what the
person is communicating, that is, their communicative intention. But it also depends on the person’s
general behavior tendencies, their personal way of behaving. To investigate how this influence can be
replicated in ECAs, we propose the implementation of ECAs exhibiting distinctive behavior.
We say that two ECAs behave in a distinctive way if and only if:
• They behave differently, that is, given the same communicative intention (for example, the agent
is greeting the user) they exhibit at least one of the following kinds of behavior differences:
– in the types and number of signals produced; for example, while saying “Hello”, an agent
will perform a head nod, another one will raise a hand;
– in the quality of movement of the produced signals; for example, while saying “Hello”,
even if two agents choose to perform an head nod, they will perform it with, for example,
different speed/amplitude/acceleration;
• They maintain their behavior tendencies across time and across situations: that is, if we observe
the way in which two differently defined distinctive ECAs communicate the same communicative
intention (the agent is greeting the user) then we are able to distinguish between these two agents
also in other situations (for example while describing an object, or giving directions, etc.).
4
1.3 Objectives
In the previous Section we propose to implement ECAs exhibiting distinctive behavior. Let us consider
the conceptual diagram in Figure 1.1.
Figure 1.1: A conceptual diagram of the system implement in this thesis.
In the diagram we represent the connection between the agent’s general tendencies, communicative
intention and behavior. The black box represents a system that, receiving as input the first two kinds
of data, combines them in order to determine the agent’s behavior.
In this thesis, we aim to produce an implementation of a system in which the computation of
the agent’s behavior follows the definition of distinctiveness reported in Section 1.2. The diagram of
Figure 1.1 is split in three areas, which correspond to the three main goals of this thesis:
1. Modeling the agent’s general behavior tendencies: we aim to define the agent’s general behavior
tendencies. This will allow us to model, for example, an agent having the tendency to use more
gestures, or facial expressions, etc.
2. Modeling the communicative intention influences: we seek to create a model for defining the
nonverbal behaviors (e.g., gestures, facial expressions, torso movements, etc.) that could be
used by an agent to communicate an intention, such as the intention to describe something, or
5
to give directions, etc. In our system, the agent’s emotional states are considered to be one of
the communicative intention classes the agent’s can communicate.
3. Implementing a distinctive behavior generation system: to implement distinctive behavior in
ECAs, we have to describe how the agent’s actual communicative intention may influence the
way in which the agent produces nonverbal behaviors. We have to propose a method to combine
this information with the agent’s general behavior tendencies.
Two additional goals are pursued in this thesis:
4. Evaluating our distinctive behavior generation system: we seek to evaluate the quality of our
model of distinctive ECAs by performing perception tests.
5. Obtaining an extendible/extensible system: we aim to create a system which is extensible and
flexible. Our system will allow the addition of factors influencing the agent’s behavior, for
example (as reported in Figure 1.1) the agent’s personality, culture, etc.
1.4 Contributions
The contribution of this thesis is to define, implement and test ECAs exhibiting distinctive behavior:
• We define a model for the ECA’s general behavior tendencies. We represent it with two concepts:
the agent’s preference in using each modality (e.g., an agent may prefer to use mainly its face,
or gestures); the agent’s expressivity of movement, that is, a set of parameters that influence
the amplitude, speed, fluidity, energy and repetitivity of the nonverbal signals produced by the
agent. We define an XML-based representation language that allows us to define the agent’s
behavior tendencies in a global, static way: “agent A has the global tendency to behave in the
way W”. We refer to the agent’s global behavior tendencies, defined with such a language, as
the Baseline.
• We define the influence of the agent’s communicative intention on the ECA’s behavior tendencies.
We do that by defining behavior qualifiers: these allow us to model the modulation of the agent’s
Baseline by the agent’s local behavior tendencies: “agent A, having the tendency to behave in
the way W and with the communicative intention C, has the local tendency to behave in the
way WC”. We refer to the agent’s local behavior tendencies as the Dynamicline. Again, we use
an XML-based language allowing us to add or modify these behavior qualifiers.
6
• We propose a representation language to describe the mapping between communicative inten-
tions and nonverbal behaviors. A certain communicative intention can be communicated through
several combinations of nonverbal behaviors. For example, to communicate the affirmation “yes”,
we can produce a head nod, raise our thumb up, or perform both behaviors at the same time.
Our language allows us to define all the possible combinations of signals representing a given
communicative intention, called behavior sets. Constraints can be defined over the produced
signals. By using an XML-based language, one can easily modify or extend the behavior sets
associated to each communicative intention.
• We propose a system that, considering an agent with its Baseline and the intention it aims
to communicate, computes the agent’s local behavior tendencies (using the behavior qualifiers
defined above) and determines the nonverbal behaviors the agent has to produce, according to
the communication behavior sets.
• We evaluate the quality of our model by performing evaluation studies on the output of our
system. We perform two kinds of evaluation:
– from an objective point of view: we look at the output of the system, to establish if it
reflects the global and local tendencies of the agent;
– from a subjective point of view: we ask human participants to observe and rate the agent’s
behavior.
1.5 Thesis outline
In the next Chapter we give an overview of the background concepts we refer to in the thesis. We
propose a definition of Embodied Conversational Agents and of nonverbal communication: we distin-
guish between the communicative intention, that is, what we aim to communicate, and the nonverbal
signals, that is, the movements and gestures we produce with our body in conveying a particular
intention.
We split the rest of the thesis in two parts. The first part is devoted to the description of our
system for creating distinctive ECAs. In Chapter 3 we give an overview of the system by determining
where our work is located in the general framework of an agent interacting with humans or other
agents. Our work is related to the process of deriving the agent’s behavior generation based on its
communicative intention. We implement our system in the framework of the Greta ECA. In Chapter
7
4 we describe the languages we defined to model the agent’s communicative intention and nonverbal
behavior. In Chapter 5 we present our definition of the agent’s global behavior tendencies: an agent has
a certain degree of preference in using each of its communicative modalities (i.e., face, head, gestures,
gaze, torso) and a certain qualitative way of performing nonverbal behaviors (i.e., speed, amplitude,
energy, fluidity, repetitivity, of movement). We explain how we modulate these general tendencies
depending on the agent’s communicative intention, to obtain the agent’s local behavior tendencies.
We introduce a notation for both of these concepts and the process of computing local from global
tendencies. In Chapter 6 we focus on the generation of multimodal signals starting from the agent’s
local behavior tendencies. Again, we introduce a notation to represent this correspondence and we
explain how we use it to perform the computation of the agent’s behavior. In Chapter 7 we describe
our system for the generation of distinctive behaviors in ECAs by putting together the concepts and
modules presented in Chapters 5 and 6. In Chapter 8 we present a study we conducted to evaluate
our system. We have performed two kinds of evaluation: an objective evaluation, in which we check
if the output of our system is objectively the one we expected (that is, for example, if the Baseline of
the agent really influences the multimodal signals chosen by our system); and a subjective evaluation,
in which we have conducted a perceptual test by asking participants to look at animations of the
Greta agent and to evaluate them on the basis of the multimodal signals produced by the agent. We
conclude this part of the thesis with Chapter 9 in which we present an overview of other ECA systems
exhibiting variable behaviors: some of them vary the agent’s behavior by modeling its emotional state
or personality profile; others assign a specific repertoire of nonverbal signals; others allow one to vary
the performed nonverbal signals by changing their quality of execution.
In the second part of the thesis we describe two application scenarios for our system: in the first
scenario the agent’s behavior is determined by a musical performance provided as input; in the second
one the agent mimics the quality of movements performed by a human while being filmed with a
camera. In both cases we include a description of the system implementation specific for that scenario
and an evaluation study.
8
Chapter 2
Background
2.1 Nonverbal communication
Nonverbal communication is an essential element of human-human communication, together with
the verbal message. By nonverbal communication, we refer to all the communicative signals, with
the exception of spoken words, that are produced by actions such as facial expressions, hand/arm
gestures, posture shifts, and so on, and also through voice by variations in volume, pitch, speed of
speech.
The general paradigm of nonverbal communication, as explained by Argyle [4], can be described
as follows. Consider 2 entities, A and B. A’s intention, or goal, is to communicate his state (e.g. his
emotional state) to B. Nonverbal communication consists in encoding A’s intention into nonverbal
behaviors, or signals, which may be decoded by B, not necessarily as intended by A. Following the
definition of Duncan [34], nonverbal communication consists in the transmission of a mental repre-
sentation from one person to another via bodily gestures. In Figure 2.1 we represent the sender (A)
on the left side of the diagram and the receiver (B) on the right side. In the diagram we highlight
9
the process of transmission of the sender’s intention to the receiver, who interprets it. Constantly, as
highlighted by the arrow going back from the receiver to the sender, nonverbal communication always
need a feedback/reaction from the receiver in order to ensure that the information sent by the sender
has been successfully received and/or understood by the receiver.
Figure 2.1: The paradigm of nonverbal communication [4].
The relation between communicative intentions and nonverbal signals is in general a many-to-many
relation [2]:
• a given meaning can be communicated in a large variety of ways through nonverbal signals (to
say a simple ok I can nod, show a thumb up, etc.);
• the same signal can serve many intentions (for example raising the palm of the hand may signify
hello, or stop, etc.).
As an example of the complexity of the problem, let us consider a scenario proposed by Allwood
in [2]: suppose we produce a simple verbal yes. We can accompany it in many different nonverbal
ways, for example by nodding with the head. In this case, the global communicated meaning is one
of affirmation. If, instead of nodding, we choose to raise our eyebrows, the communicated meaning
could be surprise, or maybe doubt.
The way in which we choose some signals instead of others to convey a given communicative
intention is variable, and depends on factors like: the relation/disposition between the information
sender and receiver; the sender’s and receiver’s personalities and emotional states, their culture, social
role and idiosyncratic habits; the environmental conditions.
10
2.2 Nonverbal signals
In this thesis we consider four categories of signals, identified by the following modalities: gesture, face,
torso and head. For each modality we will now briefly describe: how the related signals are physically
(i.e. how the involved parts of the body are conFigured) and temporally performed; which/how
communicative intentions can be communicated through that modality (we will talk in this case of
the communicative function of a certain class of signals, e.g. gestures).
2.2.1 Gestures
In this Section we provide an overview of communicative gestures. These are the nonverbal signals
produced with the hand/arm accompanying speech. The following description mainly refers to the
works of Mcneill [67] and Kendon [54]. Gestures are produced continuously while we speak. We use
gestures to represent objects as well as abstract concepts, to indicate concrete places, to give emphasis
to the most important part of the discourse, and so on. While we describe objects for example, we
gesticulate for approximately three quarters of the total speaking time. In this thesis we will focus on
those gestures which are produced along with speech with the goal of communicating some intention
in the speaker’s mind. Of course there are categories of gestures, other than the communicative ones,
which are used in other situations: pantomimes for example, in which gesture almost completely
replace speech, and sign languages, used by deaf people to communicate. Following McNeill’s theory,
there are 4 major types of gestures occurring with speech:
• Iconics: these are related to the semantic content of the speech. The shape of the hands/arms
and/or their movement trajectories are used to visualize some concrete characteristic of the
uttered speech. For example, while referring to a box which was on the table, we can depict
a square shape with our hands or fingers (by moving the index fingers along a squared path),
while we pronounce the sentence “did you see the cards box?”.
• Metaphorics: they are similar to iconics, but they are used to describe some abstract concept
by a movement or a particular hand/arm configuration. For example, if we refer to a person
to say that he/she is the owner of a shop, we can underline the concept of owning something
with the following gesture: arms along the body, with a flexion of ninety degrees at the elbows
(forearms toward the listener); palms up, hands opened, we bend the fingers while pronouncing
the word owner. In this example, we want to give the idea of owning a shop by grasping an
invisible object (the shop) with our hands and at the same time our hands are ideally placed
11
below that object, to suggest the idea that we support/own/take care of it.
• Beats: they do not have a form related to speech, and they are repeated rhythmically with
usually short and quick movements along with the speech accents. Typical beats are short
movements of the hand or fingers up and down. These gestures are important because they
help the speaker to mark some characteristics of the spoken sentences. For example, beats are
produced when discussing new or important themes, or to emphasize the uttered speech.
• Deictics: they are pointing gestures. They can point to concrete objects or highlight positions in
the world, but very often they refer to abstract concepts and parts of the discourse. For example
pointing down is usually to refer to the topic of the discourse. Deictics can be differentiated also
by the hand shape used to show objects: an extended finger is used to point to a single object;
an open hand is used to indicate multiple objects; a hand shape in which the middle two fingers
are flexed with the index and little finger extended can be used to give directions or show a
route.
The timing of gesture
The execution of a gesture can be divided into temporal segments or phases which are temporally
linked to the uttered speech. In Figure 2.2 we represent the wrist position in time during the execution
of a gesture, in which the hand is lifted up and moved down.
Figure 2.2: Temporal segmentation of the gesture execution.
12
Let us use this example to illustrate what happens while we perform a gesture. The hand moves
from its initial rest position to reach the position in space in which the gesture will be produced
(preparation phase); movement stops for a short while (pre-stroke hold phase); the gesture movement
is performed (stroke phase); movement stops for a short while (post-stroke hold phase); the hands
go back to their initial rest position, or to another rest position (retraction phase). The stroke
phase represents the moment in which the “expression” of the gesture is accomplished. Preparation
and retraction are optional phases, but the stroke must always be present during the execution of a
gesture. The two hold phases (pre-stroke and post-stroke hold) have a different role in the execution
of a gesture. The pre-stroke hold is used to synchronize the stroke with speech. During preparation,
the gesture is set up, the hand arrives to the position in space preceding the stroke and the hold phase
allows one to perform the stroke exactly at the right time (see below for discussion on the stroke
synchronization). The post-stroke hold allows one to extend the meaning conveyed by the stroke for
the duration of the hold.
The way in which the stroke is synchronized to speech depends on the type of gesture being
performed and on the meaning to be conveyed [67]. As a general rule, the stroke of the gesture always
precedes or ends with the peak syllable of the speech. The preparation phase, if present, precedes the
stroke and could also start during the sentences preceding the one associated to the gesture. Sometimes
this occurs to allow the arm/hands to be prepared in time to perform the stroke in time. In other cases
preparation anticipation can mean for example that we are formulating the following sentence (the
one carrying the most stressed syllable of speech) while still uttering the previous ones. In any case,
the stroke must occur no later than the accented syllable of speech. Iconic or metaphoric gestures,
instead, could present a variation of this rule: the stroke or multiple repetitions of the stroke happen
while the speech containing the idea depicted by the iconic gesture is uttered (semantic synchrony
rule [67]).
2.2.2 Facial expressions
Facial expressions are movements performed by contracting the muscles of the face. Facial signals
have many functions in nonverbal communications. They are used to show our emotional state, our
beliefs and goals, our attitudes or opinions towards people (or objects, places, situations, etc.), to
regulate the flow of conversation, and so on.
Many researchers agree on the theory that facial expressions are the main mean for communicating
our emotional states [38] [4] [2] [58]. When showing our emotional state, there is a set of facial
13
expressions that do not vary across many cultures in the world [38]. These expressions are associated
with the emotional states of: happiness, surprise, fear, sadness, anger and disgust. When we are sad,
for example, we raise the inner part of eyebrows, while the outer part is lowered; at the same time
we press down our mouth’s external corners. When showing anger we pull down and draw together
the eyebrows; we raise our upper eyelids and we square and tighten our lips. While we speak, we
always have a goal [5] [80]: to inform the interlocutor about something, warn him, suggest something,
ask him to perform some action, etc.. Facial expressions can be used to aid the communication of
these types of information: for example, raising the eyebrows when suggesting something, or making
a small frown when warning or thinking [80]. Facial expressions are also used in regulating the flow of
conversation. For example we can open our mouth to try to interrupt other people speaking and to
show that we want to speak. Or we can use a smile to encourage the others to engage in a conversation.
Some researchers have found that specific facial displays are associated with conveying our attitude,
evaluation or opinion about people, object, events, etc. These displays are called personal reactions
by Chovil in [27]. For example, wrinkling nose (like in the emotional emblem of “disgust” [37]) can be
used to show dislike or disapproval; the same meaning can be conveyed by raising the eyebrows raising
plus raising one side of upper lip and squinting the eyes. Some facial displays are used in coincidence
with the syntactic punctuation of the text we are pronouncing (commas, exclamation or question
marks, pauses, etc.). For example, eyebrows raising is usually used to underline exclamations while
eyebrows lowering is placed at juncture pauses. Finally, the face is used by the listener to give feedback
to the speaker. The listener smiles to show approval for what the speaker is saying; or he can frown his
eyebrows to signal that he is not understanding, and so on. In other cases, we use facial expressions
to complement or substitute speech. For example, an ironical face (for example, wrinkling nose like
for showing “disgust”) while saying “you look nice today” allows us to communicate that we think
exactly the opposite concept [80]. Sometimes instead, facial expressions emphasize what is going to be
said. Raising or lowering the eyebrows is the most common emphasis signal conveyed through facial
expression [36] [27]. Less frequently, emphasis is accompanied by eyes widening or tightening. These
emphasizing signals can be compared to the beats of the gesture modality as they occur while the
corresponding stressed syllable is pronounced (see Section 2.2.1). When facial expression completely
replaces speech, we talk about facial emblems. These expressions have direct verbal translations [38]
that is common to the individuals of certain cultural group. In some cultures, like in the U.S.,
they are mainly produced with the eyebrows. An example of a facial emblem is the eyebrow flash (a
repeated quick eyebrow raising movement) which is part of a greeting signal in cultures of New Guinea.
14
Sometimes, parts of facial displays associated with the emotional states are used. For example smiling
is an emblem used to greet in many cultures. Or the mouth opened like in the surprise emotional
state can be used for the same purpose as the verbal “wow!”.
Temporally speaking, the execution of facial expressions is usually splitted into three phases: the
temporal interval in which muscles contract is called the onset ; after that, the facial expression is shown
on the face for a certain time interval, called the apex ; finally, the facial expression disappears from
the face during the offset phase. Figure 2.3 shows the trapezoid representing the level of activation of
a facial expression over time.
Figure 2.3: Temporal execution of a facial expression.
2.2.3 Torso movements
Torso movements in nonverbal communication have not been widely investigated. Until now, re-
searchers have mainly focused on posture (how we change posture during conversation) and body
in general (e.g., jumping if we are happy). With the word torso we refer to the upper part of the
body, that we may also call trunk, including the shoulders, in the sense of the shoulder alignment and
rotation. For example, “rotating the shoulders away from the listener” can be considered as a torso
movement, or “bow with the trunk to greet someone” is again a torso signal. Therefore, we refer to
other works in which the posture shifts have been investigated, even if a posture change involves not
only the trunk but the whole body configuration. Cassell et al. [22] have observed some monologues
and transcribed posture shifts which occurred at discourse turn and segment boundaries. They did
not code the shifts occurring as part of a whole body gesture, for example when changing posture for
15