Chapter 1
Introduction
With the increasing power of computer technology, companies and institutions can
nowadays store large amounts of data at reduced cost. The amount of available data
is increasing exponentially and cheap disk storage makes it easy to store data that
previously was thrown away. There is a huge amount of information locked up in
databases that is potentially important but has not yet been explored. The growing size
and complexity of the databases makes it hard to analyze the data manually, so it is
important to have automated systems to support the process. Hence there is the need
of computational tools able to treat these large amounts of data and extract valuable
information.
In this context, Data Mining provides automated systems capable of processing
large amounts of data that are already present in databases. Data Mining is used to
automatically extract important patterns and trends from databases seeking regularities
or patterns that can reveal the structure of the data and answer business problems. Data
Mining includes learning techniques that fall into the eld of Machine learning. The
growth of databases in recent years brings data mining at the forefront of new business
technologies [WF05].
To apply and develop their new research ideas, data scientists need large quantities
of data. Most of the time however business valuable data is not freely available, so
it is not always possible for a data expert to have access to real data. Competitions
are usually the occasion for data miners to access real business data and compete with
other people to nd the best technique to apply to the data. The Kaggle website
(http://www.kaggle.com/) is a web platform where companies have the opportunity to
post their data and have it scrutinized by data scientists. In this way data experts have
the opportunity to access real dataset and solve problems with the opportunity to win a
prize given by the company.
7
The competition used for this thesis was a Claim Prediction Challenge
[http://www.kaggle.com/c/ClaimPredictionChallenge] proposed by Allstate
Insurance. The goal of the competition was to predict the amount of money the
insurance has to pay to its clients. This amount represents an important gure in
order to determine the corresponding premium asked to their customers. Many factors
contribute to the frequency and severity of car accidents including how, where and under
what conditions people drive, as well as the type of vehicle they use. Bodily Injury
Liability Insurance covers other people’s bodily injury or death for which the insured is
responsible. The company provided data from 2005 to 2007 to contestants, who analyzed
correlations between vehicle characteristics and bodily injury claims payments to predict
claims payment amounts in 2008. Using the data provided, modelers evaluated how
factors such as horsepower, length and number of cylinders a ect the likelihood of an
insured being held responsible for injuring someone in a car accident.
Figure 1.1: Claim Prediction competition
The contribution of this thesis is to analyze and discuss the advantages and disadvan-
tages of the standard techniques used in data mining and machine learning. The thesis
rst presents a general overview of data mining in Chapter 2 and provides a state of the
8
art of di erent learning algorithms. In Chapter 3, we detail the data and the problem
addressed by this competition, and present the methodology that will be used. Chapter
4 presents and discusses the experimental results which were obtained. Chapter 5 nally
concludes the thesis with a summary of the results and the methodology, together with
future work which could be investigated in order to improve the work presented in this
thesis.
9
10
Chapter 2
Data Mining algorithms: overview
2.1 Data Mining de nition and notations
Data mining is a eld of computer science that involves methods from statistics, arti cial
intelligence, machine learning and data base management. The main goal of data mining
is to nd hidden patterns in large data sets. This means performing automatic analysis
of data in order to nd clusters within the data, outliers, association rules and prediction
models that can explain the data. The information discovered by data mining can be
learnt by machine learning algorithms and applied to new data. What is important
is that the patterns found by data mining are useful to explain the data and/or make
predictions from it.
Tom Mitchell [Mit97] gives a nice de nition of what ‘learning for a computer" means:
A computer program is said to learn from experience E with respect to some class of
task T and performance measure P, if its performance at task in T, as measured by P,
improves with experience E
In learning program three things must be de ned:
the task
the performance’s measure
the source of experience (training experience)
Machine learning is the sub eld of computer science that deals with the design of
algorithms and techniques that allow computers to learn [LB09].
The goal of Machine Learning is to nd algorithms that can learn automatically from
the past without the need of assistance. The algorithm is trained to learn from data and
come up with a model that can be applied to new data.
11
These algorithms are data driven, so they are not guided by the impressions of a
human expert. To see if an algorithm has e ectively learnt something useful or not, it
is usually tested with a new data set whose outcome is known in order to evaluate its
outcome against the real one.
Let’s introduce some data mining concepts that will be use in the thesis.
Dataset A dataset is a collection of data of the same phenomenon given in a
tabular form. The columns represent the attributes or variables. The rows, the
instances/examples belonging to the dataset. Each instance has k values, one for
each of the k attributes in the dataset.
Attribute / Feature An attribute is one of the available variables present in the
dataset. This means that the column of a dataset represent the possible values
taken by an attribute.
Dimension The dimension of a dataset is equivalent to the number of attributes
present in the dataset. In case the dataset is composed of only two variables, its
dimension is two.
Training and Testing sets The training and test set are di erent collection of
data of the same phenomenon used in supervised learning. The rst is used to train
an algorithm to learn, the second to evaluate the learning.
Distribution A distribution is a functions that describe the frequencies of the
values within the phenomenon.
Outlier An outlier is an observation that appears to deviate markedly from other
members of the sample in which it occurs [Gru69].
Data mining and Machine learning algorithms can be classi ed into supervised or
unsupervised learning depending on the goal of the algorithm. Supervised methods are
used when there is a variable whose value has to be predicted. Such a variable is referred
to as response or output variable. In an unsupervised method, the data are not labeled
and there is no value to predict or classify.
Supervised learning algorithms generate a function that is able to map the input/out-
put values. In these algorithms the data provides examples about the kind of relationship
between the input and output variables that has to be learnt. In unsupervised learning,
there is no output value, but instead just a collection of input values.
Supervised learning algorithms can be further divided into classi cation and regression
algorithms. In the case of classi cation, the goal is to predict qualitative or categorical
12
outputs which assume values in a nite set of classes (e.g. True, False or Green , Red,
Blue, ecc..) without an explicit order. Categorical variables are also called Factors. In
regression the output to predict is a real-valued number.
2.2 Supervised learning
In Supervised learning, the training set consists of several input variables and an output
variable which is a categorical variable in the case of a classi cation problem and a real-
valued number in the case of regression. The goal of supervised algorithms is to uncover
the relationships between input/output variables and to generalize the function so it can
be applied to new data. The output variables are in general assumed to be dependent on
the inputs. In this thesis input variables will be called X and the output Y .
Classi cation and regression learning algorithms fall into supervised methods because
they work under the supervision of an output variable. This output variable is called the
class or response value. The success of supervised learning is evaluated by applying the
model on new set of data called test data, for which the values of the output are known
but not used during the learning stage. The prediction accuracy on this test set can be
used as a measure of a model’s ability to predict the response value.
A supervised learning therefore requires to have a measure of how well or bad the
model predicts the response variable. This is expressed with a loss function that is used
to assess the adequacy and e ectiveness of di erent methods. Before applying one model
to a test set, the model accuracy is usually estimated using a cross-validation approach
with the training data.
The elements of a supervised learning algorithm are:
A target operator y(x) that describes the relationship between input-output vari-
ables. This is a stochastic process that represents the distribution of y(x) given x,
(P (yjx)).
A training set that contains the data where the relationship between the inputs-
outputs variables can be learnt.
A prediction model that estimates the input/output variables relationship by means
of a functions and returns a predicted value for each examples.
A loss functionL(y; ^ y) that is used to measure the goodness of the model by means
of the di erences between the predicted output ^y and the actual output y.
A learning procedure that tests di erent model parameter in order to minimize the
loss function and nds the best model for the target function.
13
The function quantifies the euclidean distance between the predicted and the desired
output. The popularity of this loss function in regression settings mainly stems from
the fact that it is differentiable, which considerably simplifies optimization procedures
related to the parametric identification of a prediction model (see Section 2.2.2 and
2.2.3).
In classification problems, the notion of euclidean distance is not appropriate for as-
sessing the loss. The output space is a finite set of classes, not necessarily ordered.
The most popular loss function, called the misclassification function, is defined as
L
0/1
:Y⇥ Y ! {0,1}
y,ˆ y 7!
⇢ 0 if y6=ˆ y
1 if y = ˆ y
(2.2)
and quantifies the match between the predicted and the desired output by a binary
indicator.
• A learning procedure which consists, on the basis of a training set, in producing a
model h
✓ (x)=h(x,✓ ) with ✓ 2 ⇤ . The goal is to minimize the loss between the
prediction model outputs ˆ y and the actual outputs y. Assuming that the parameters
✓ completely define the prediction model, the parameter space⇤ is also referred to as
the model space. The learning procedure is an essential part of a learning task, and is
the subject of Section 2.2.2.
An overview of the supervised learning setting is given in Figure 2.6.
Target function
Prediction model
Training set
output ˆ y
output y input x
error L(y,ˆ y)
Learning procedure
Figure 2.6: The supervised learning setting. A training set made of input and output
observations of the target operator is used by a learning procedure to identify a prediction
model of the target operator. The goal is to minimize the error between the true output y
and the prediction model ˆ y, quantified by a loss function L(y,ˆ y).
Example 2.3: Let us assume that the true relationship between x and y is defined by the
stochastic processy(x)=x
2
+⌫ where ⌫ ⇠ N(0, ⌫ ) is a Gaussian random noise, and that a
training set of N = 100 examples has been collected. A linear model of the form ˆ y = ✓ 1
+✓ 2
x,
although not optimal, could be used as a first try to represent the relationship. The parameter
space is R
2
,theve ctorofp ar ametersis✓ =( ✓ 1
,✓ 2
).T h el e a s ts q u a r et e c h n i q u e ,b a s e do n
the quadratic loss function, can be used as the learning algorithm to estimate the parameters
✓ .Thiste chniquewillb edetaile dinSe ction2.2.3.
26
Figure 2.1: Supervised learning actors and procedure [LB09].
The goal of the learning procedure is to obtain a model which is able to nd the
dependency between the output and input variable. The model should be general enough
to be applied to new data and return a small error. In general, complex models tend to
over t, i.e., they give good performances when applied to the training data, but poor
results when applied to new data. On the contrary simple models tend to under t the
data, since they cannot represent important information in the model. Hence a model
has to be general enough to return good performances in di erent datasets without losing
discrimination power. The assessment of the generalization ability is called validation
and this is done using techniques such as cross validation.
A k-fold cross validation consists in partitioning the training data into k smaller
subsets of equal size. Then k 1 subsets are used as a training set and the remaining
subset as a test set. This allows to compute the model on a new training set and assess
its generalization error on the subset used as the test set. This partition of the training
set is repeated K times and at each time the subset used for testing is changed. At the
end, the validation procedure provides an estimate of the generalization error and this
can be used to select between di erent model’s parameters.
In the case of a 10 fold cross validation the training data is divided into 10 di erent
parts of equal size. Then one tenth of the instances present in the training set are used
for testing and the remaining nine tenth for the training ( gure 2.2).
Once the rst round of validation is completed, another subset of equal size is used for
testing, and the remaining 90% of the instances used for training as before ( gure 2.3).
The process is iterated 10 times to ensure the all instances become part of the training
and test set.
14