Comparison of Data Mining Techniques for Insurance Claim Prediction

Gratis L'anteprima di questa tesi è scaricabile gratuitamente in formato PDF.
Per scaricare il file PDF è necessario essere iscritto a Tesionline. L'iscrizione non comporta alcun costo: effettua il Login o Registrati.

Mostra/Nascondi contenuto.

Chapter 1
Introduction
With the increasing power of computer technology, companies and institutions can
nowadays store large amounts of data at reduced cost. The amount of available data
is increasing exponentially and cheap disk storage makes it easy to store data that
previously was thrown away. There is a huge amount of information locked up in
databases that is potentially important but has not yet been explored. The growing size
and complexity of the databases makes it hard to analyze the data manually, so it is
important to have automated systems to support the process. Hence there is the need
of computational tools able to treat these large amounts of data and extract valuable
information.
In this context, Data Mining provides automated systems capable of processing
large amounts of data that are already present in databases. Data Mining is used to
automatically extract important patterns and trends from databases seeking regularities
or patterns that can reveal the structure of the data and answer business problems. Data
Mining includes learning techniques that fall into the eld of Machine learning. The
growth of databases in recent years brings data mining at the forefront of new business
technologies [WF05].
To apply and develop their new research ideas, data scientists need large quantities
of data. Most of the time however business valuable data is not freely available, so
it is not always possible for a data expert to have access to real data. Competitions
are usually the occasion for data miners to access real business data and compete with
other people to nd the best technique to apply to the data. The Kaggle website
(http://www.kaggle.com/) is a web platform where companies have the opportunity to
post their data and have it scrutinized by data scientists. In this way data experts have
the opportunity to access real dataset and solve problems with the opportunity to win a
prize given by the company.
7
The competition used for this thesis was a Claim Prediction Challenge
[http://www.kaggle.com/c/ClaimPredictionChallenge] proposed by Allstate
Insurance. The goal of the competition was to predict the amount of money the
insurance has to pay to its clients. This amount represents an important gure in
order to determine the corresponding premium asked to their customers. Many factors
contribute to the frequency and severity of car accidents including how, where and under
what conditions people drive, as well as the type of vehicle they use. Bodily Injury
Liability Insurance covers other people’s bodily injury or death for which the insured is
responsible. The company provided data from 2005 to 2007 to contestants, who analyzed
correlations between vehicle characteristics and bodily injury claims payments to predict
claims payment amounts in 2008. Using the data provided, modelers evaluated how
factors such as horsepower, length and number of cylinders aect the likelihood of an
insured being held responsible for injuring someone in a car accident.
Figure 1.1: Claim Prediction competition
The contribution of this thesis is to analyze and discuss the advantages and disadvan-
tages of the standard techniques used in data mining and machine learning. The thesis
rst presents a general overview of data mining in Chapter 2 and provides a state of the
8
art of dierent learning algorithms. In Chapter 3, we detail the data and the problem
addressed by this competition, and present the methodology that will be used. Chapter
4 presents and discusses the experimental results which were obtained. Chapter 5 nally
concludes the thesis with a summary of the results and the methodology, together with
future work which could be investigated in order to improve the work presented in this
thesis.
9
10
Chapter 2
Data Mining algorithms: overview
2.1 Data Mining denition and notations
Data mining is a eld of computer science that involves methods from statistics, articial
intelligence, machine learning and data base management. The main goal of data mining
is to nd hidden patterns in large data sets. This means performing automatic analysis
of data in order to nd clusters within the data, outliers, association rules and prediction
models that can explain the data. The information discovered by data mining can be
learnt by machine learning algorithms and applied to new data. What is important
is that the patterns found by data mining are useful to explain the data and/or make
predictions from it.
Tom Mitchell [Mit97] gives a nice denition of what ‘learning for a computer" means:
A computer program is said to learn from experience E with respect to some class of
task T and performance measure P, if its performance at task in T, as measured by P,
improves with experience E
In learning program three things must be dened:
the task
the performance’s measure
the source of experience (training experience)
Machine learning is the subeld of computer science that deals with the design of
algorithms and techniques that allow computers to learn [LB09].
The goal of Machine Learning is to nd algorithms that can learn automatically from
the past without the need of assistance. The algorithm is trained to learn from data and
come up with a model that can be applied to new data.
11
These algorithms are data driven, so they are not guided by the impressions of a
human expert. To see if an algorithm has eectively learnt something useful or not, it
is usually tested with a new data set whose outcome is known in order to evaluate its
outcome against the real one.
Let’s introduce some data mining concepts that will be use in the thesis.
Dataset A dataset is a collection of data of the same phenomenon given in a
tabular form. The columns represent the attributes or variables. The rows, the
instances/examples belonging to the dataset. Each instance has k values, one for
each of the k attributes in the dataset.
Attribute / Feature An attribute is one of the available variables present in the
dataset. This means that the column of a dataset represent the possible values
taken by an attribute.
Dimension The dimension of a dataset is equivalent to the number of attributes
present in the dataset. In case the dataset is composed of only two variables, its
dimension is two.
Training and Testing sets The training and test set are dierent collection of
data of the same phenomenon used in supervised learning. The rst is used to train
an algorithm to learn, the second to evaluate the learning.
Distribution A distribution is a functions that describe the frequencies of the
values within the phenomenon.
Outlier An outlier is an observation that appears to deviate markedly from other
members of the sample in which it occurs [Gru69].
Data mining and Machine learning algorithms can be classied into supervised or
unsupervised learning depending on the goal of the algorithm. Supervised methods are
used when there is a variable whose value has to be predicted. Such a variable is referred
to as response or output variable. In an unsupervised method, the data are not labeled
and there is no value to predict or classify.
Supervised learning algorithms generate a function that is able to map the input/out-
put values. In these algorithms the data provides examples about the kind of relationship
between the input and output variables that has to be learnt. In unsupervised learning,
there is no output value, but instead just a collection of input values.
Supervised learning algorithms can be further divided into classication and regression
algorithms. In the case of classication, the goal is to predict qualitative or categorical
12
outputs which assume values in a nite set of classes (e.g. True, False or Green , Red,
Blue, ecc..) without an explicit order. Categorical variables are also called Factors. In
regression the output to predict is a real-valued number.
2.2 Supervised learning
In Supervised learning, the training set consists of several input variables and an output
variable which is a categorical variable in the case of a classication problem and a real-
valued number in the case of regression. The goal of supervised algorithms is to uncover
the relationships between input/output variables and to generalize the function so it can
be applied to new data. The output variables are in general assumed to be dependent on
the inputs. In this thesis input variables will be called X and the output Y .
Classication and regression learning algorithms fall into supervised methods because
they work under the supervision of an output variable. This output variable is called the
class or response value. The success of supervised learning is evaluated by applying the
model on new set of data called test data, for which the values of the output are known
but not used during the learning stage. The prediction accuracy on this test set can be
used as a measure of a model’s ability to predict the response value.
A supervised learning therefore requires to have a measure of how well or bad the
model predicts the response variable. This is expressed with a loss function that is used
to assess the adequacy and eectiveness of dierent methods. Before applying one model
to a test set, the model accuracy is usually estimated using a cross-validation approach
with the training data.
The elements of a supervised learning algorithm are:
A target operator y(x) that describes the relationship between input-output vari-
ables. This is a stochastic process that represents the distribution of y(x) given x,
(P (yjx)).
A training set that contains the data where the relationship between the inputs-
outputs variables can be learnt.
A prediction model that estimates the input/output variables relationship by means
of a functions and returns a predicted value for each examples.
A loss functionL(y; ^ y) that is used to measure the goodness of the model by means
of the dierences between the predicted output ^y and the actual output y.
A learning procedure that tests dierent model parameter in order to minimize the
loss function and nds the best model for the target function.
13
The function quantiﬁes the euclidean distance between the predicted and the desired
output. The popularity of this loss function in regression settings mainly stems from
the fact that it is diﬀerentiable, which considerably simpliﬁes optimization procedures
related to the parametric identiﬁcation of a prediction model (see Section 2.2.2 and
2.2.3).
In classiﬁcation problems, the notion of euclidean distance is not appropriate for as-
sessing the loss. The output space is a ﬁnite set of classes, not necessarily ordered.
The most popular loss function, called the misclassiﬁcation function, is deﬁned as
L
0/1
:Y⇥ Y ! {0,1}
y,ˆ y 7!
⇢ 0 if y6=ˆ y
1 if y = ˆ y
(2.2)
and quantiﬁes the match between the predicted and the desired output by a binary
indicator.
• A learning procedure which consists, on the basis of a training set, in producing a
model h
✓ (x)=h(x,✓ ) with ✓ 2 ⇤ . The goal is to minimize the loss between the
prediction model outputs ˆ y and the actual outputs y. Assuming that the parameters
✓ completely deﬁne the prediction model, the parameter space⇤ is also referred to as
the model space. The learning procedure is an essential part of a learning task, and is
the subject of Section 2.2.2.
An overview of the supervised learning setting is given in Figure 2.6.
Target function
Prediction model
Training set
output ˆ y
output y input x
error L(y,ˆ y)
Learning procedure
Figure 2.6: The supervised learning setting. A training set made of input and output
observations of the target operator is used by a learning procedure to identify a prediction
model of the target operator. The goal is to minimize the error between the true output y
and the prediction model ˆ y, quantiﬁed by a loss function L(y,ˆ y).
Example 2.3: Let us assume that the true relationship between x and y is deﬁned by the
stochastic processy(x)=x
2
+⌫ where ⌫ ⇠ N(0, ⌫ ) is a Gaussian random noise, and that a
training set of N = 100 examples has been collected. A linear model of the form ˆ y = ✓ 1
+✓ 2
x,
although not optimal, could be used as a ﬁrst try to represent the relationship. The parameter
space is R
2
,theve ctorofp ar ametersis✓ =( ✓ 1
,✓ 2
).T h el e a s ts q u a r et e c h n i q u e ,b a s e do n
the quadratic loss function, can be used as the learning algorithm to estimate the parameters
✓ .Thiste chniquewillb edetaile dinSe ction2.2.3.
26
Figure 2.1: Supervised learning actors and procedure [LB09].
The goal of the learning procedure is to obtain a model which is able to nd the
dependency between the output and input variable. The model should be general enough
to be applied to new data and return a small error. In general, complex models tend to
overt, i.e., they give good performances when applied to the training data, but poor
results when applied to new data. On the contrary simple models tend to undert the
data, since they cannot represent important information in the model. Hence a model
has to be general enough to return good performances in dierent datasets without losing
discrimination power. The assessment of the generalization ability is called validation
and this is done using techniques such as cross validation.
A k-fold cross validation consists in partitioning the training data into k smaller
subsets of equal size. Then k 1 subsets are used as a training set and the remaining
subset as a test set. This allows to compute the model on a new training set and assess
its generalization error on the subset used as the test set. This partition of the training
set is repeated K times and at each time the subset used for testing is changed. At the
end, the validation procedure provides an estimate of the generalization error and this
can be used to select between dierent model’s parameters.
In the case of a 10 fold cross validation the training data is divided into 10 dierent
parts of equal size. Then one tenth of the instances present in the training set are used
for testing and the remaining nine tenth for the training (gure 2.2).
Once the rst round of validation is completed, another subset of equal size is used for
testing, and the remaining 90% of the instances used for training as before (gure 2.3).
The process is iterated 10 times to ensure the all instances become part of the training
and test set.
14

Anteprima dalla tesi:

Comparison of Data Mining Techniques for Insurance Claim Prediction

CONSULTA INTEGRALMENTE QUESTA TESI

La consultazione è esclusivamente in formato digitale .PDF

Acquista

Informazioni tesi

Autore:	Andrea Dal Pozzolo
Tipo:	Laurea II ciclo (magistrale o specialistica)
Anno:	2010-11
Università:	Università degli Studi di Bologna
Facoltà:	Scienze Statistiche
Corso:	Statistica economica, finanziaria ed attuariale
Relatore:	Gianluca Moro
Lingua:	Inglese
Num. pagine:	81

FAQ

Come consultare una tesi

Per consultare la tesi è necessario essere registrati e acquistare la consultazione integrale del file, al costo di 29,89€.
Il pagamento può essere effettuato tramite carta di credito/carta prepagata, PayPal, bonifico bancario.
Confermato il pagamento si potrà consultare i file esclusivamente in formato .PDF accedendo alla propria Home Personale. Si potrà quindi procedere a salvare o stampare il file.
Maggiori informazioni

Perché consultare una tesi?

Ingiustamente snobbata durante le ricerche bibliografiche, una tesi di laurea si rivela decisamente utile:

perché affronta un singolo argomento in modo sintetico e specifico come altri testi non fanno;
perché è un lavoro originale che si basa su una ricerca bibliografica accurata;
perché, a differenza di altri materiali che puoi reperire online, una tesi di laurea è stata verificata da un docente universitario e dalla commissione in sede d'esame. La nostra redazione inoltre controlla prima della pubblicazione la completezza dei materiali e, dal 2009, anche l'originalità della tesi attraverso il software antiplagio Compilatio.net.

Clausole di consultazione

L'utilizzo della consultazione integrale della tesi da parte dell'Utente che ne acquista il diritto è da considerarsi esclusivamente privato.
Nel caso in cui l’utente che consulta la tesi volesse citarne alcune parti, dovrà inserire correttamente la fonte, come si cita un qualsiasi altro testo di riferimento bibliografico.
L'Utente è l'unico ed esclusivo responsabile del materiale di cui acquista il diritto alla consultazione. Si impegna a non divulgare a mezzo stampa, editoria in genere, televisione, radio, Internet e/o qualsiasi altro mezzo divulgativo esistente o che venisse inventato, il contenuto della tesi che consulta o stralci della medesima. Verrà perseguito legalmente nel caso di riproduzione totale e/o parziale su qualsiasi mezzo e/o su qualsiasi supporto, nel caso di divulgazione nonché nel caso di ricavo economico derivante dallo sfruttamento del diritto acquisito.

Vuoi tradurre questa tesi?

L'obiettivo di Tesionline è quello di rendere accessibile a una platea il più possibile vasta il patrimonio di cultura e conoscenza contenuto nelle tesi.
Per raggiungerlo, è fondamentale superare la barriera rappresentata dalla lingua. Ecco perché cerchiamo persone disponibili ad effettuare la traduzione delle tesi pubblicate nel nostro sito.

Scopri come funziona »

DUBBI? Contattaci

Contatta la redazione a
[email protected]

Ci trovi su Skype (redazione_tesi)
dalle 9:00 alle 13:00

Oppure vieni a trovarci su

Parole chiave

data mining

machine learning

statistics

insurance claim

Tesi correlate

Non hai trovato quello che cercavi?

Abbiamo più di 45.000 Tesi di Laurea: cerca nel nostro database

Oppure consulta la sezione dedicata ad appunti universitari selezionati e pubblicati dalla nostra redazione

Ottimizza la tua ricerca:

individua con precisione le parole chiave specifiche della tua ricerca
elimina i termini non significativi (aggettivi, articoli, avverbi...)
se non hai risultati amplia la ricerca con termini via via più generici (ad esempio da "anziano oncologico" a "paziente oncologico")
utilizza la ricerca avanzata
utilizza gli operatori booleani (and, or, "")

Idee per la tesi?

Scopri le migliori tesi scelte da noi sugli argomenti recenti

Come si scrive una tesi di laurea?

A quale cattedra chiedere la tesi? Quale sarà il docente più disponibile? Quale l'argomento più interessante per me? ...e quale quello più interessante per il mondo del lavoro?

Scarica gratuitamente la nostra guida "Come si scrive una tesi di laurea" e iscriviti alla newsletter per ricevere consigli e materiale utile.

Leggi la guida

La tesi l'ho già scritta,
ora cosa ne faccio?

La tua tesi ti ha aiutato ad ottenere quel sudato titolo di studio, ma può darti molto di più: ti differenzia dai tuoi colleghi universitari, mostra i tuoi interessi ed è un lavoro di ricerca unico, che può essere utile anche ad altri.

Il nostro consiglio è di non sprecare tutto questo lavoro:

È ora di pubblicare la tesi

Scopri di più

Comparison of Data Mining Techniques for Insurance Claim Prediction

Anteprima dalla tesi:

Comparison of Data Mining Techniques for Insurance Claim Prediction

CONSULTA INTEGRALMENTE QUESTA TESI

La consultazione è esclusivamente in formato digitale .PDF

Informazioni tesi

FAQ

Come consultare una tesi

Perché consultare una tesi?

Clausole di consultazione

Vuoi tradurre questa tesi?

DUBBI? Contattaci

Parole chiave

Tesi correlate

Non hai trovato quello che cercavi?

Ottimizza la tua ricerca:

Idee per la tesi?

Come si scrive una tesi di laurea?

La tesi l'ho già scritta,ora cosa ne faccio?

Login

La tesi l'ho già scritta,
ora cosa ne faccio?