5
Chapter 2
Literature Review
In this chapter the literature review is reported. We start by talking about the basic principles related
to image processing, then we will talk about the most important 3D imaging systems, and some of
the automatic evaluation methods that exist for the assessment of progress of the patient about the
facial paralysis. The comparison between the existing 3D imaging systems will take us to outline
the advantages and disadvantages of these systems, and then to motivate the use of SMILE system
proposed by Frey et al., 1999. The problems about the used SMILE system will take us to introduce
a new cylindrical calibration device. In this regard, in the later section, we will talk about the most
important calibration algorithm, checkerboard pattern and corner detector, camera model and how
this is used to map more 2D points to a single 3D point.
2.1 Image Processing
Computer vision is the discipline that tries to emulate human vision not only as the acquisition of a
two-dimensional image but also to enable machines to understand and interpret the visual information
obtained from 2D still images or video sequences. Computer Vision is a rapidly growing field thanks
to the availability of more accurate and less expensive cameras as well as the high computational
power of modern computers. Computer vision has a very wide scope, the camera is, in fact, a non-
invasive tool that can operate at high distances if equipped with appropriate optics. The versatility
of computer vision is due to the great amount of information carried by an image, and the ability to
extract only that information that is necessary for the purpose of the application, while the accuracy
of the results can be strongly influenced by factors ranging from camera resolution to algorithms
configuration. An artificial vision system is usually made of one or more cameras connected to a
computer, this is used to automatically interpret the images of a real scene, and so to get useful
information for the application. What has been said up to now can be summarized in the following
work flow.
FIGURE 2.1: Image processing workflow
There are three types of operations that can be performed on an image.
6 Chapter 2. Literature Review
Pixel operation: each pixel in the resulting image depends exclusively on the corresponding
pixel in the input image. This operation does not alter the spatial relationships between pixels.
Local operation: each pixel in the resulting image depends on the corresponding pixel and its
neighborhood in the input image.
Global operation: pixels in the resulting image depend on all pixels in the input image.
2.1.1 Preprocessing
Low level vision aims to recover degradations due, for example, to noise or changes in light conditions
within the image. This objective is typically achieved using linear and non-linear transformations of
RGB color space. Typically used color spaces are:
HSV is a non-linear color space consisting in a cone where the color feature is defined by three
parameters: the angle identifies the Hue (H), distance from the center identifies the Saturation
(S) and the Value (V) is identified in the height of the cone.
CIELAB is a linear color space where L stands for lightness, a and b are two color features,
these can be understood as a linear combination of S and V bands of HSV . The intention of
CIELAB is to produce a color space that is more perceptually linear than other color spaces.
Perceptually linear means that a change of the same amount in a color value should produce a
change of about the same visual importance.
YCbCr is a family of color spaces used as a part of the color image pipeline in video and digital
photography systems. Y is the luma component and Cb and Cr are the blue-difference and
red-difference chroma components. Cb and Cr are two color features, these can be understood
as a linear combination of S and V bands of HSV , but this space in contrast to CIELAB is not
linear.
Those color spaces, and more, are available in Matlab. The key characteristic of these transforma-
tions, besides being useful to deal with illumination changes over time, is that those color spaces are
invertible so you can, through a different interpretation of the bands, view the original image and
come back to RGB representation. There are also other transformations used to normalize the color
range, these are typically not invertible, some examples are given in Gevers and Smeulders, 1997.
The Normalized RGB normalization is obtained by dividing the R, G and B coordinates by their
total sum, the r, g and b components of the RGB color system are obtained. The transformation from
RGB coordinates to this normalized color is given by
r =
R
ed
R
ed
+Green+B
lue
g =
Green
R
ed
+Green+B
lue
b =
B
lue
R
ed
+Green+B
lue
(2.1)
This transformation projects a color vector in the RGB cube onto a point on the unit plane. Two of the
RGB values suffice to define the coordinates of the colour point in this plane. Since RGB is redundant
(b = 1 r g), the produced normalized color space is typically formulated as
Y =k
1
R
ed
+k
2
G
reen
+K
3
B
lue
T
1
=
R
ed
R
ed
+Green+B
lue
T
2
=
Green
R
ed
+Green+B
lue
(2.2)
2.1. Image Processing 7
where k
1
, k
2
and K
3
are chosen constants such that k
1
+k
2
+K
3
= 1, Y is interpreted as the
luminance of an image pixel, andT
1
andT
2
are chromatic variables. Note that the convention between
RGB space and normalized RGB presents a singularity at the black vertex of the RGB cube. For
R
ed
=G
reen
=B
lue
= 0, the quantities in Eq. 2.1 are undefined.
The color featuresc
1
c
2
c
3
are defined as follow:
c
1
= arctan
R
max(G;B)
c
2
= arctan
G
max(R;B)
c
3
= arctan
B
max(R;G)
(2.3)
These color features are invariants for matte, dull objects, only dependent on the sensors and the
surface albedo
1
. The color features l
1
l
2
l
3
is a photometric invariant (lighting conditions invariant)
for matte as well as for shiny surfaces. These color features determine uniquely the direction of the
triangular color plane
2
spanned by the two reflection terms. The color features are defined as follow:
l
1
=
(R G)
2
K
l
2
=
(R B)
2
K
l
3
=
(G B)
2
K
(2.4)
whereK = (R G)
2
+ (R B)
2
+ (G B)
2
. As we can see these color features are defined as a
set of normalized colour differences.
However, the main problem of those normalizations is the fact that they map black and white
colors of the original image in another color.
Degradations due the noise could be retrieved through the use of an appropriate filtering operation
from those given in Survey of image denoising techniques. An example could be the median filter that
can be used to remove the salt and pepper noise, or more complex operations involving the use of
wavelets, or by the estimation of Modulation Transfer Function of the camera (Boreman, 2001).
FIGURE 2.2: An example of segmentation
2.1.2 Features Extraction and Segmentation
Mid level vision deals with the extraction of structural information from the image, features such
as geometric primitives, shape, depth, size, contour of objects are extracted. A large number of
operation is performed on the input image in order to highlight certain details. Image processing does
not directly produce new information, but only use what can be obtained from the input image; the
purpose is to simplify the image by using segmentation techniques, these are designed to recognize
1
Albedo is the "whiteness" of a surface, it is a reflection coefficient, and has a value of less than one.
2
RGB unit-cube is rappresented as a triangle shown at following link: i.stack.imgur.com/Qcv1f.png
8 Chapter 2. Literature Review
uniform regions according to the given features, i.e. discriminate between the objects of interest and
background, or anything else. An example is shown in Figure 2.2.
The segmentation process depends on what you want to achieve from the image, the process ends
when objects or regions of interest were identified. There are two possible situations, the boundary
conditions, and the structure of the image is know in advance; or nothing is known a priori, in this case
specific methods are used to identify regions of interest. There are two approaches to segmentation:
Similarities (segmentation regions based): the methods rely on similarities between regions.
Discontinuity (segmentation contour based): the methods make a partition of the image based
on the intensity changes in the image (such as edge). The edges of the regions must be suffi-
ciently different from each other and also from the background.
Segmentation algorithms typically take into account local information contained within the image
(Zhang, Fritts, and Goldman, 2008), at this purpose we have to give a definition of a neighborhood of a
pixel. In computer vision local characteristics of a neighborhood are given in terms of 4-connectivity,
8-connectivity and m-connectivity. Two pixels are said in connectivity if they are spatial related
and their color level (or grey level) meets a specific criteria of similarity which are defined a priori.
This definition is also useful to identify connected components within the image, namely regions
in the image with uniform properties. Other example of region based segmentation algorithms are
Watershed described in Vincent and Soille, 1991, Region Growing described in Shih and Cheng,
2005 and Split e Merge described in Kelkar and Gupta, 2008.
The goal of contour based segmentation algorithms is to identify the contours, where in the com-
mon perception a contour is the line of separation placed between the end of an object and the begin-
ning of another. However, an algorithm may not have such knowledge, and therefore will be only able
to detect transitions between homogeneous regions; in other words, the algorithm will highlight the
boundaries between regions and not between objects. With high probability the boundary between
regions is corresponding to that between objects but in general this is not necessarily true. In fact, we
can distinguish a person as an individual item or as a collection of objects: head, knit, etc.. One of the
most important algorithms in edge detector is Canny Edge Detector described in Canny, 1986, which
outputs an image where the contours are of one pixel. The algorithm has some input parameters that
can be used in order to improve the results in edge detection. You can also detect the contours in the
input image through the use of a filtering operation, the most important filters are Sobel, Prewitt and
others edge detector techniques described in Bhardwaj and Mittal, 2012.
The background subtraction techniques described in Sobral and Vacavant, 2014 are typically used
in video sequences and aims to highlight the changes within the scene. The algorithm outputs an
estimation of the background, static elements within the scene or in any case objects that are not of
interest, and the foreground, moving objects within the scene or objects that are of interest. There are
several approaches to estimate the background, the simplest technique is to update the background
image by using the following formula:
BG
(t)
= v
BG
(t 1)
+ (1 v
) currFrame (2.5)
2.1. Image Processing 9
By doing so we are keeping a history of the frames captured by the camera. If an object enters the
scene and stops for a long time, this will be absorbed by the estimated backgroundBG
(t)
and will not
be detected as a object of interest. The value of v
should be determined by taking also into account
the threshold that is used in the next step, however usually simplified formula is used v
= v
=
v
where v
is the frame rage and v
is the desired dwell time. The, the foreground image is obtained by
subtraction and threshold operation. The subtraction is performed among the estimated background
BG
(t)
and currFrame that is the image currently obtained from the scene. The technique described
previously, although the simplicity, cannot be used in such situations where there is, for example,
the vegetation, or other constantly moving objects that are not of interest, within the scene. The
movement of the leaves generates a significant local variation of the current frame if compared to
the background. These situations are dealt with the use of the Mixture of Gaussian. This technique
generates an estimation of the background where the value of each pixel in the image is replaced by
a statistical distribution.
There are also some algorithms of features extraction. Typical examples of features are the Local
Binary Pattern, these are used since has been found to be a powerful feature for texture classification;
A full survey of the different versions of LBP can be found in Bouwmans et al., 2016. There is also
SURF, this algorithm can be used for tasks such as object recognition, image registration, classifi-
cation or 3D reconstruction. SURF was first presented by Bay et al., 2008 and it is partly inspired
by the scale-invariant feature transform (SIFT) introduced by Lowe, 2004. Another example is the
Histogram of Orientated Gradients (HOG) algorithm the basic idea is to believe that the shape of an
object within the image can be described by using the intensity distribution of the gradients and the
direction of the boundaries. An histogram of orientated gradients is build, where each bin represent
an orientation of the gradient, the occurrences for each bin are obtained according to the intensity
of the gradient. However the main problem of the introduced techniques lies in the fact that these
techniques assumes that color (RGB) images are before converted to gray scale images as it is stated
in Bhattacharyya, 2011.
2.1.3 Model Matching
High-level vision aims to build a semantic model of the scene in order to produce as output a clas-
sification by using a classifier. Classifiers operate in two modes that are training and classification.
During the training phase, which can be done in different ways, the classifier is trained to recognize a
given pattern. The training phase can be done in two ways:
Supervised learning: also known as learning by examples, is performed by providing some
labeled samples (training set). Each output of the classifier is compared with the corresponding
labeled samples, and the classification error is computed according to a given learning rule.
The classification error of the current step is used to improve the classification in the next step.
Those operations are repeated over all training set until the classification error is below a given
threshold.
Unsupervised learning: in this case we don’t have a labelled training set. The training is per-
formed by using the "similarity" between the samples and the output of the training is a clus-
tering of the input samples.
10 Chapter 2. Literature Review
The dataset is so divided into two subsets training and test sets. The first is used during the learning
phase, while the second is used during the evaluation phase, in order to evaluate the performance of
the classifier. The evaluation is carried out by counting how many samples were correctly classified
over the total. In any case you must give a description of the samples in the training-set by using
features.
In Machine Learning and Pattern Recognition, a feature is a measurable property of an observed
phenomenon. Choose features that carry a lot of information, independent one to each one another,
and also discriminating, is very important for computer vision algorithms, especially for the classi-
fication (training phase of the classifier). In fact, if a feature is obtainable as linear combination of
the other features, the resulting feature it is not useful for classification. In addition, the initial set
of features can be very redundant and wide to be handled. So, after the features extraction, usually
a Feature selection operation is performed (Principal Component Analysis or Linear Discriminant
Analysis) in order to extract a significant subset of features, this usually makes the learning easy and
also improve the performance of the classifier. Features generate a multidimensional space, called
parametric features space, which has a dimensionality given by the number of features; it is a vector
space, to which a metric can be applied. Euclidean or Hamming distance can be used as an index
of similarity between samples within the space; obviously this assumption must be valid in real and
practical applications. It would be desirable that very similar samples (or objects) have as well a
very small Euclidean (or Hamming) distance, while very dissimilar samples (or objects) have a big
Euclidean (or Hamming) distance. However, there is no guarantee for that property, since it is related
to the choice of the features to use. Even if it looks simple, the association of a sample to a class is
a very difficult operation. Calculator should calculate the similarity distance between the unknown
samples to all other samples in the features space, and then locate the nearest one. The computational
complexity can be reduced by doing some additional operation on the features space in order to make
partitions, where partitions can be open or closed.
A partition is said to be open if there is at least one sample at infinity that is still part of the
partition. In this case the algorithm has high generalization capability, and provides in any case a
response to any classification problem. However, using a classification approach with open partitions
involve in some issue: even if the distance between the unknown sample and all other samples in
the features space is very very big, the unknown sample will get a classification, but in this case the
distance value suggest us that there is a huge dissimilarity among real entities.
A partition is said to be closed when only the samples that are within a given distance to the other
samples in the features space will be classified (as belonging to one class), otherwise the sample is
rejected.
Another peculiarity of the classifiers is the degree of freedom, which is the number of parameters
that can be used by the classifier in order to optimize a given function. This affects the results and
the complexity of the training phase. If the number of degrees of freedom of the classifier were high
enough, the classifier would be able to make more complex partitions of the features space. This
allows us to have no error on the training set and high specialization of the algorithm, however in this
way the classifier is able to correctly recognize only samples from the training set, but failing in the
classification in all other cases.
So, we understand that classification is not only the ability to distinguish objects and classify
2.1. Image Processing 11
them, but also the ability to identify situations in which the belonging class cannot be specified. In
summary, learning means delimiting regions that are closed, and not open, because we knew that the
open regions are usually dangerous. However a problem arises, knowing that a wider region could
give us a greater generalization and that to a smaller region corresponds more specialization and
rejection, how can we determine the trade-off between generalization and rejection-specialization?
Usually, this depends by the particular application, the level of generalization is selected according
to the penalty introduced as a result of an error, so if we want to design a money validator high
rejection-specialization is needed, otherwise if we want to design a mail system this should have high
generalization capability because in this case the penalty introduced as a result of an error is less.
FIGURE 2.3: Training cycles of a classifier
Figure 2.3 shows the learning curve, error on training set, and the error curve on the test set after
some training cycle. The trend of the two curves is the same until to a certain number of cycles,
then the error on the training set goes to zero, while the error on the test set begins to grow (goes to
infinity). This due to the fact that the classifier is specialized on the training set, and it fails on the test
set so the performance decreases. The stop condition of the training algorithm is then determined by
minimizing the error on test set. This is true until we do not change the training set and the test set,
because changing even one of the two (training set or test set) the resulting trend of the two curves
will be completely different. Once a training has been carried out, no one can say if it is good or not,
because there might be another dataset for which you get different results, but you must first find it
out.
We understand then that better results are not achieved by changing the classification algorithm.
In this way, everything is reversed on the remaining phases of the scheme introduced previously in
figure 2.1. The basic idea is that it does not matter which classification algorithm we use, however
the dataset must be well built, sufficiently varied, large enough, and the used features must be well
representative of the observed phenomenon, otherwise the results will never be acceptable. This
approach of data classification described by Goodfellow, Bengio, and Courville, 2016 is knew as
Deep Learning.
12 Chapter 2. Literature Review
For the purposes of this section typically used algorithms are for example neural networks, support
vector machines, violet jones, learning vector quantization, all the variants of the nearest neighbour
classification and so on. The previous algorithms can also be used in segmentation tasks by classifying
pixels within the image.
2.2 Treatment and evaluation of facial paralysis
Treatment and evaluation of facial palsy requires techniques and tools that are especially dedicated
for the purpose. Today, there are a lot of tools that can be used to treat the facial paralysis. Those
instruments are usually used in combination with some methods with trying to grade the progress of
facial palsy. The scale used to characterize the degree of facial paralysis is the House-Brackmann
described in House and Brackmann, 1985. In this scale, grade I is assigned to normal function, while
grade VI represents complete paralysis.
TABLE 2.1: House-Brackmann grading scale House and Brackmann, 1985
Grade Description
I Normal symmetrical function in all areas
II Slight weakness noticeable only on close inspection
III Obvious weakness, but not disfiguring
IV Obvious disfiguring weakness
V Motion barely perceptible
VI No movement, loss of tone, no contracture or spasm
In the later section, we will describe the most important tools, as well as some of the most impor-
tant and typically used method for automatic evaluation of the palsy. At the end of this section we will
give a description of the SMILE system proposed by Frey et al., 1999. This system is currently used
at the Vienna Medical University, so we will also give a description about the usage of this system.
2.2.1 Imaging systems for 3D
The term "three-dimensional (3D) imaging" has referred to techniques which calculate 3D volumetric
pixels (or voxels) from 2D data acquired from a measured target. As it is stated in Tzou et al., 2014
and in Lane and Harrell, 2017 there are two different type of optical systems used for imaging process
of measuring and analyzing surfaces (x, y and z coordinates), one uses structured light and another
uses stereo photogrammetry.
Structured-light technology uses one projector and a calibrated camera. The camera captures
an image of the object overlaid by a projected pattern. With the knowledge about the design and
geometry of a projected pattern and perception of the deformation by the 3D surface of the object, it
is possible to estimate the 3D surface of the object and generate a 3D surface image.
There are three different strategies for stereo photogrammetry:
Active stereo photogrammetry is based on structured light. It projects a pattern onto the surface
of an object and uses two or more cameras to capture the deformation of the pattern. A 3D surface