model and segment these subunits, then try to learn the informative combinations
of subunits/features using a boosting framework. Our results reached above 90%
recognition rate using very few training samples.
8
Chapter 1
Introduction
1.1 Introduction
In daily life, human beings communicate with each other and interact with computers
using gestures. As a kind of gesture, sign language (SL) is the primary communi-
cation media for deaf people. Everyday, millions of deaf people all over the world
are using SL to get useful information and exchange ideas. Therefore, in recent
years, SL recognition has gained a lot of attention and a variety of solutions have
been proposed. Sign gestures might be treated as a composition of hand shape, mo-
tion, position, and facial expression. Thus, SL recognition requires knowledge of all
of these. Generally, a SL recognition system should contain three major modules:
skin segmentation and tracking (SST), feature extraction, and recognition. The first
module is to acquire and locate hands and face across the video frames. The purpose
of the second module is to prepare useful features for classification.
Fig. 1.1 demonstrates a general system architecture overview for a SLR system.
Based on segmented hands and face, we can extract the hand shape, orientation, and
facial expression features. Through analyzing the tracked skin objects, we obtain the
hand motion trajectories, hand position, and lip movement. Finally, classifiers are
trained to recognize the signs.
9
Figure 1.1: System architecture
1.2 Device vs Vision approach to SLR
According to the means of capturing features, SL recognition techniques can be clas-
sified into two groups: glove-based and vision-based. The former group of approaches
requires users to wear data or colour gloves. The glove enables the system to avoid or
simplify the segmentation and tracking task. However, its disadvantages are appar-
ent. On the one hand, users have to carry a hardware device, which makes them feel
uncomfortable. Sometimes, they cannot perform accurate gestures with the gloves.
On the other hand, the glove-based methods might lose the facial expression infor-
mation, which is very important for the SL recognition as well.
In comparison, the vision-based methods rely on computer vision techniques without
needing any gloves, which is more natural for users. However, one difficulty is how
to accurately segment and track hands and face. SST plays an important role in
vision-based SL recognition. Only after skin objects have been acquired, useful de-
scriptions such as hand shape, motion, and facial expression, and further recognition
are possible. In other words, SST is the cornerstone of SL recognition. In order to
produce high quality SST, two techniques must be developed: a powerful skin colour
model and a robust tracker. The skin colour model offers an effective way to detect
and segment skin pixels. It should be able to handle illumination and human skin
variations. The tracker is responsible for locating skin objects. For SL recognition,
it should be capable of predicting occlusions that frequently happen in real world SL
10
conversations. The purpose of occlusion detection is to keep track of the status of
the occluded parts, which helps to reduce the search space in the recognition phase.
1.3 Overview of the proposed SLR system
This work aims to provide an SST framework for SL recognition, then given that we
can acquire the required useful features we propose a novel solution for SLR based
on boosting SL subunits. To achieve precise skin segmentation, we introduce a novel
skin colour model integrating SVM active learning and region segmentation. This
model consists of two stages: a training stage and segmentation stage. In the train-
ing stage, first, for the given gesture video, a generic skin colour model is applied
to the first few frames, which obtains the initial skin areas. Afterwards, a binary
classifier based on SVM active learning is trained using obtained initial skin areas
as the training set. In the segmentation stage, the SVM classifier is incorporated
with the region information to yield the final skin colour pixels. The contribution
that distinguishes the proposed model from other existing skin colour algorithms is
twofold. First, the SVM classifier is trained using the training data automatically
collected from the first several video frames, which does not need human labour to
construct the training set. More importantly, the training is performed for every
video sequence. It is adaptive to different human skin colours and lighting condi-
tions. The skin colour model can also be updated with the help of tracking to deal
with illumination variation. Second, region information is adopted to reduce noise
and illumination variation. Moreover, active learning is employed to select the most
informative training subset for the SVM, which leads to fast convergence and better
performance.
As for the tracker, we extend the previous work of our group in three ways. First, in
the previous work they used the colour glove to avoid the segmentation issue. In this
work, we are more interested in improving SL recognition in natural conversation.
Three features, skin color, motion, and position, are fused to perform accurate skin
object segmentation. Additionally, the previous work tracked two hands wearing
11
color glove only. However, the proposed work can segment and track two hands
and face. The obtained face information definitely could facilitate the recognition.
Second, we apply a Kalman filter (KF) to predict occlusions in the same way as the
previous work. Nevertheless, our KF is based on the skin colour instead of colour
glove. Third, in the proposed work, tracking and segmentation tasks are approached
as one unified problem where tracking helps to reduce the search space used in
segmentation, and good segmentation helps to accurately enhance the tracking per-
formance.
Despite the great deal of effort in SLR so far, most existing systems can achieve good
performance only with small vocabularies or gesture datasets. Increasing vocabulary
inevitably incurs many difficulties for training and recognition, such as the large
size of required training set, signer variation and so on. Therefore, to reduce these
problems, some researchers have proposed to decompose the sign into subunits. In
contrast with traditional systems, this idea has the following advantages. First, the
number of subunits is much smaller than the number of signs, which leads to a small
sample size for training and small search space for recognition. Second, subunits
build a bridge between low-level hand motion and high-level semantic SL under-
standing. In general, a subunit is considered to be the smallest contrastive unit in a
language in the field of linguistics. A number of researchers have provided evidence
that signs can be broken down into elementary units. However, there is no generally
accepted conclusion yet about how to model and segment subunits in the computer
vision field.
This work investigates the detection of subunits from the viewpoint of human motion
characteristics. We model the subunit as a continuous hand action in time and space.
It is a motion pattern that covers a sequence of consecutive frames with interrelated
spatio-temporal features. In terms of the modelling, we then integrate hand speed
and trajectory to locate subunit boundaries. The contribution of our work lies in
three points. First, our algorithm is effective without needing any prior knowledge
like the number of subunits within one sign and the type of sign. Second, the tra-
jectory of hand motion is combined so that the algorithm does not rely on clear
12
pauses as in some previous related work. Finally, because of the use of an adaptive
threshold in motion discontinuity detection and refinement by temporal clustering,
our method is more robust to noise and signer variation.
After segmenting the SL subunits, we attempt to develop an effective SLR system
using the AdaBoost algorithm which tries to learn informative subunit and fea-
ture combinations needed to achieve good classification performance. To our best
knowledge, very little work has been done using Adaboost in SLR. We present two
variations for learning boosted subunits. In the first case, we train the sign classes
independently, and in the second case, we train the classes jointly, which permits the
various classes to share the weak classifiers to increase the overall performance.
The presented work enables us to efficiently recognize SL with a large vocabulary
using a small training dataset. One important advantage of our algorithm is that
it is inspired by human signing behaviour and recognition ability so it can work in
a manner analogous to humans. Experiments on real-world signing videos and the
comparison with classical HMM-based weak classifiers demonstrate the superiority
of the proposed work.
In this thesis, we aimed to provide new different techniques that can be applied
in SLR applications. Our goal was to contribute to research in skin segmentation,
hand and face tracking, modelling and recognizing signs efficiently based on human
behaviour in performing and recognizing signs using informative subunits of the signs.
1.4 Overview of the Thesis
The next chapter introduces the reader to the literature review of the different SLR
systems proposed by different research groups to solve the problem of SLR.
Chapter 3 gives a review of current skin segmentation techniques and discusses
our proposed skin segmentation algorithm with various evaluation results.
Chapter 4 introduces our proposed SST system and provides some experimental
results for skin segmentation and tracking.
Chapter 5 introduces the subunit modelling and segmentation algorithm and ends
13
with some evaluation experiments.
Chapter 6 introduces our SLR system based on learning boosted subunits and
presents the experimental results of the classification.
Chapter 7 concludes with a summary, and gives some future work directions.
14
Chapter 2
Sign Language Recognition
Literature review
2.1 Introduction
In taxonomies of communicative hand/arm gestures, Sign Language (SL) is often
considered as the most structured form of gesture, while gestures that accompany
verbal discourse are described as the least standardized. SL communication also in-
volves non-manual signals (NMS) through facial expressions, head movements, body
postures and torso movements [Ong and Ranganath 05].
SLR therefore requires observing these features simultaneously together with their
synchronization, and information integration. As a result, SLR is a complex task
and understanding it involves great efforts in collaborative research in machine
analysis and understanding of human action and behaviour; for example, face and
facial expression recognition [Kong et al. 04, Pantic and Rothkrantz 00], tracking
and human motion analysis [Gavrila 99, Wang et al. 03], and gesture recognition
[Pavlovic et al. 97].
As non-SL gestures often consist of small limited vocabularies, they are not a useful
benchmark to evaluate gesture recognition systems. However, SL on the other hand
can offer a good benchmark to evaluate different gesture recognition systems because
it consists of large and well-defined vocabularies, which can be hard to disambiguate
15
by different systems.
In real life, we can imagine many different useful applications for SLR such as:
• sign-to-text/speech translation system or dialog systems for use in specific pub-
lic domains such as airports, post offices, or hospitals
[McGuire et al. 04, Akyol and Canzler 02].
• In video communication between deaf people, instead of sending live videos,
SLR can help to translate the video to notations which are transmitted and
then animated at the other end to save bandwidth [Kennaway 03].
• SLR can help in annotating sign videos [Koizumi et al. 02] for linguistic anal-
ysis to save a lot of human labour manually in ground truthing the videos.
SL gesture data is mainly acquired using cameras (vision-based) or sensor devices
(glove-based) [Sturman and Zeltzer 94]. We are interested here in the vision-based
approach as the glove-based approach has the limitation of being an unnatural way of
performing signs. However, it can simplify a lot the tasks of segmentation (especially
in the presence of occlusions) and tracking. But it ignores the fact that we need
the facial expression as an important feature. In the next sections, we will try
to summarize the related work done by different research groups in SLR. We will
cover the three main tasks of hand detection and tracking, feature extraction and
classification.
2.2 Hand detection and tracking
In almost all SLR systems, the hand(s) must be detected in the image sequence, and
this is usually based on features like colour, motion, and/or edge. The colour cue is
used by skin colour detection or using colour gloves such as in [Sweeney and Downton 96,
Sutherland 96, Bauer and Kraiss 02, Assan and Grobel 97, Bauer and Kraiss 01].
When skin colour is used, the user is usually required to wear long sleeves to avoid
the skin colour of the arm area. Skin colour was combined with a motion cue in
16
[Akyol and Alvarado 01, Imagawa and Igi 98, Yang et al. 02] and with edge infor-
mation in [Terrillon et al. 02]. Different assumptions were used to distinguish the
hands from the face such as that the head is relatively static compared to the hands
[Akyol and Alvarado 01, Imagawa and Igi 98] or that the head is bigger than the
hands [Yang et al. 02].
A common requirement for the motion cue, is that the hand must be continuously
moving as in [Huang and Jeng 01] where the hand was detected by logically AND-
ing difference images with edge maps and skin-color regions. In [Cui and Weng 00,
Cui and Weng 99] a hierarchical nearest neighbour decision rule was used to map
partial views of the hand to previously learned hand contours to obtain an outline
of the hand.
In [Huang and Huang 98] the hands were detected assuming that it is the only object
moving against a stationary background and that the head is relatively stationary.
In [Ong and Bowden 04] they used a boosted cascade of classifiers to detect hand
shapes, where dark backgrounds were used and signers were asked to wear long-
sleeved dark clothing. Other related work also tried to localize body parts such as
the body torso [Bauer and Kraiss 02, Assan and Grobel 97], or elbow and shoulders
[Hienz et al. 96] along with the hands and face, based on the body geometry and
colour cues. This helps to reference the position and movement of the hands to the
signer's body.
Hand tracking can be done either in 2D or 3D. In 2D tracking, approaches can be
classified to boundary-based [Huang and Huang 98, Cui and Weng 00], view-based
[Huang and Jeng 01], blob-based [Tanibata et al. 02, Imagawa and Igi 98], and match-
ing motion regions [Yang et al. 02]. One of the hard problems in tracking is occlu-
sion. Generally speaking, in most systems that are based on skin colour, occlusion
handling is poor and not satisfactory. Some systems try to predict the hand location
based on the model dynamics and previous frame positions with the assumption of
small constant hand motion [Starner et al. 98, Imagawa and Igi 98].
In [Starner et al. 98] they subtracted the face region from the merged face/hand
blob, but unfortunately this method only can handle small overlaps. In [Imagawa 00]
17
they applied a sliding observation window over the merged blob of face/hand and
the likelihood of the window subimage was calculated to classify it to one of the
possible hand shape classes. The overlapping hands and face were distinguished in
[Tanibata et al. 02] by using hand and face texture templates. This method is not
robust to change in hand shape, face orientation, or large change in facial expres-
sion.
Another interesting approach that does not track the hands and face separately
[Zieren et al. 02, Sherrah 00], but rather applies probabilistic reasoning (such as
heuristic rules [Zieren et al. 02] or Bayesian networks [Sherrah 00]) for simultane-
ous assignment of labels to the possible hand/face regions, assuming that the skin
blobs can only be assigned to the hands, thus not allowing for other skin regions
in the background. This allows more robust tracking that can deal with high over-
lapping and fast hand movement, along with complex hand interactions. Multiple
features were used such as motion, colour, orientation, size and shape of blobs, dis-
tance relative to other body parts, and Kalman filter prediction.
In [Assan and Grobel 97, Bauer and Kraiss 02, Huang and Huang 98] uniform back-
grounds were used to simplify the constraints. However, a few systems such as in
[Chen et al. 03], allow complex cluttered background that include moving objects
and apply background subtraction to extract the foreground under the assumption
that the hand is constantly moving. In contrast to the above approaches, there are
some systems that use 3D models [Vogler and Metaxas 97, Downton and Drouet 92]
by using multiple cameras to estimate the body parts and/or avoid occlusions but
of course with a great computational cost.
As skin segmentation is one of the main research areas that we will address in this
work, the next section reviews the major techniques used to detect skin pixels in
images or videos.
2.2.1 Skin segmentation review
In general, skin detection methods [Vezhnevets et al. 03] can be classified into two
groups. Pixel-based methods that classify each pixel as skin or non-skin indepen-
18