1. Introduction
Football is one of the most popular sports in the world. Every day people watch
matches all over the world. Consequently betting on the outcome of football
matches is having a considerable interest by football supporters and simple betting
fans. Every week a huge number of bookmakers offers an infinite amount of odds
on the various results of a match. Bets can be placed on the final outcome, but
also on the exact score, on the halfnulltime and fullnulltime results, on the margin of
victory and so on. Furthermore, also thanks to the global diffusion of internet, in
the last few years the sonullcalled “innullrunning betting” is getting more and more
common among the betters since people can place a bet while they are following
the match in progress.
Unlike in the other types of betting, such as in horsenullracing, the odds are
given around one week before the football matches are played. In this way we
could make a detailed comparison of bookmakers' odds and identify who offers
the most profitable odds for each match. That is why a statistical model able to
accurately predict the probabilities of the outcome of football matches has the
potential to create the basis for an optimal betting strategy.
Various proposals have been made for modelling the outcome of football
matches. For a good betting strategy, however, probabilities must be estimated on
a teamnullspecific basis, so that the probabilities of the various match outcomes
between two specific teams on a particular date can be calculated. The first
complete model able to allow for the different team effects and also for the
fluctuating performance of individual teams was made by Dixon and Coles
(1997). The goals scored by the home and the away teams have different Poisson
distributions and a sort of dependence between them is introduced.
8
)
In recent studies it was underscored that, instead of modelling the number of
goals directly, focusing on the difference of goals within a match can bring to
better results. The Skellam distribution was indicated as an optimal approximation
in this context, although the Normal distribution seems to be a fairly good model
too.
Dixon and Robinson (1998) focused on the innullrunning betting and they were
the first ones who developed a model able to give information on the score
behaviour over 90 minutes for single future matches. They essentially treated the
number of goals scored by two teams as interacting Poisson processes, having a
satisfactory fit to the data. Furthermore, by looking at the goal times data, they
investigated the common “immediate strike back theory”, according to which a
team is more vulnerable just after it has scored a goal, but they found no evidence
to affirm its validity.
Section 2 reviews the most important literature in this context, seeing the
improvements that brought to the models that we consider. The data we used are
described in Section 3. In Section 4 we analyse the DixonnullColes model and we
apply it to our dataset in order to see if our results agree with those by Dixon and
Coles (1997). In Section 5 we introduce and explain the advantages coming from
the use of the difference of goals. We build two basic models both based on
discrete versions of the Normal distribution in order to model the difference of
goals and we check the estimated results compared to those obtained by models
based on the Skellam distribution. Section 6 defines the DixonnullRobinson model,
whereas Section 7 investigates the clustering of goal times, using selfnullexciting
point processes, to see if the scoring rate within a match becomes higher after a
goal scored. In Section 8 we underscore our final conclusions and we suggest
refinements which, we believe, could lead to further improvements.
9
)
2. Literature Review
Early references to statistical modelling of football data concentrate mainly
on the distribution of the number of goals scored in a match. Moroney (1956)
suggested that the number of goals scored by a team was not perfectly wellnullfitted
by a Poisson distribution, but using a “modified Poisson”, which allows the
variability in the expectation, the fit was much better. Since the possession of the
ball by a team in a football match can result in a goal or in a nonnullgoal, Reep and
Benjamin (1968) showed that the passingnullmove distributions have a close fit to a
Negative Binomial distribution, and they defined and confirmed what Moroney
(1956) said. They came to the conclusion that “chance does dominate the game”.
Again, considering other ball games, Reep, Pollard and Benjamin (1971) proved
that the same Negative Binomial distribution can apply to the number of goals
scored by a team, regardless the quality of that team or the quality of the
opposition. Unconvinced by this, Hill (1974) showed that football experts were
able, before the season had started, to predict with some success the final league
table positions: therefore, certainly over a whole season, skill rather than chance
dominates the game. Most of the people who watch a game of football would
probably agree on this: whilst in a single match chance plays a considerable role
(for example, missed scoring opportunities, dubious offside decisions and shots
hitting the crossbar can drastically affect the result), over several matches luck
plays much less of a part. Following this way, the first model predicting outcomes
of football matches between teams with different skills was proposed by Maher
(1982). Teams are not identical: each one has its own inherent quality and surely
when a good team is playing against a weak team we should expect that the good
team will have a high probability of winning and scoring several goals. Unlike
10
)
previous works where a single Negative Binomial distribution was fitted to scores
from all matches, here each match has a different fitted Poisson distribution.
According to Maher (1982), the home and the away goals, scored during the
match, are assumed to be as two independent Poissonnulldistributed variables, where
the model parameters are defined by the difference between attacking and
defensive skills for each team, keeping in mind the advantage of playing at home.
In order to have a profitable betting strategy and to develop a statistical model
able to provide better estimates of probabilities than the subjective estimates
ascribed by bookmakers, Dixon and Coles (1997) made two important
improvements to the previous model. Firstly, they observed and modelled the
dependence between the home and the away goals for low scoring games (i.e. 0null0,
1null0, 0null1 and 1null1). In addition, they pointed out as a structural limitation the fact
that parameters were static, i.e. teams were assumed to have a constant
performance rate over time. For this reason, they decided to allow the parameters
to be dynamic, incorporating the behaviour that a team's performance is likely to
be more closely related to their performance in recent matches than in earlier
matches. This was possible thanks to the use of a weighting function, that gave a
greater/lesser weight to the likelihood of a recent/distant match.
Based on the analysis for fullnulltime scores by Dixon and Coles (1997), Dixon
and Robinson (1998) developed a model able to give information on the score
behaviour over 90 minutes for single future matches. Although the Poisson
distribution was found to have reasonable fit to the fullnulltime results they showed a
clear deviation from a homogeneous model over the 90 minutes of a match by
looking at the goal times data. They demonstrated such a deviation to be due to a
gradual increase in scoring rates during the match and to the dependence of the
scoring rates on the current score.
Instead of modelling the number of goals directly, Karlis and Ntzoufras
(2008) focused on the difference of the number of goals, i.e. the margin of victory.
11
)
Using the Skellam distribution to describe it, they eliminated the correlation
imposed by the fact that the two opponent teams compete each other and they do
not need to assume that the goals scored by each team are marginally Poissonnull
distributed. A model like this can be used to predict the outcome of the game as
well as for betting purposes related to the margin of victory. However, it cannot
predict the exact final score.
12
)
3. Data
From each football match played, we can have a lot of information: obviously
the final score, the time of the goals, who scored, but also the number of corner
kicks, the number of offsides, and so on.
An individual team's performance in each particular game could also be
affected by many external factors: newly signed players or the sacking of a
manager, for example. Although these information are also available, they are less
easily formalized and their qualitative value is subjective. Consequently, in our
whole analysis, we are going to focus our attention only on the two most
important variables that determine the outcome of a match: the number of goals
scored by the home and the away team and, eventually, in which minutes they
made the goals, i.e. the goal times.
Firstly, in order to avoid any undesirable future error, we need to do a
beginning simple check on the consistency between the final results and goal
times data, in order to check automatically if all the scores reported are consistent
with the respective goal times (see Rnullcode 1 in the Appendix C).
Data on 8563 goal times have been collected, for 3312 league matches over
the period 1999null2009 for 36 Italian football clubs from Serie A, the professional
league competition for football clubs located on the top of the Italian football
league system.
In Table 1 we report an initial analysis of the results data: the home team
goals (x), the away team goals (y), the difference of goals (i.e. xnull y = z) and the
total goals (i.e. x+y = tot). Subsequently, in Figure 1, we show some summarizing
histograms.
13
)
Tab. 1. Initial analysis of the results data: home team goals (x), away team goals (y), difference of
goals (z) and total goals (tot)
summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 1.0 1.0 1.5 2.0 7.0
table(x)
0 1 2 3 4 5 6 7
710 1126 849 433 134 52 7 1
________________________________________________________
summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 1.000 1.085 2.000 6.000
table(y)
0 1 2 3 4 5 6
1141 1185 657 236 79 12 2
________________________________________________________
summary(z)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.0000 0.0000 0.0000 0.4146 1.0000 7.0000
table(z)
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7
1 3 25 87 216 474 975 795 453 201 60 16 5 1
________________________________________________________
summary(tot)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.000 2.000 2.585 4.000 10.000
table(tot)
0 1 2 3 4 5 6 7 8 9 10
283 606 833 704 491 226 117 38 10 3 1
14
)
Matches are played over two periods of 45 minutes each. Goal times are
generally recorded to the integer part of the minute of the goal although there are
often discrepancies between sources. Thus, goals are scored sometimes at the
same minute. Figures 2(a), 2(b) and 2(c) are histograms of goal times for matches
that resulted in a home win, a draw and an away win respectively and Figure 2(d)
shows all the goal times. As Dixon and Robinson (1998) noted, two features are
evident from Figure 2. First of all, a noticeably high number of goals are scored in
the last part of each half (around the 45
th
and the 90
th
minute). This increased
number of goals scored is due to injury time, usually between 0 and 5 minutes,
added on by the referee. Goals scored in injury time are recorded as 45
th
or 90
th
minute for the first and second halves respectively. The second feature, but not
15
Fig. 1. Histograms of scored goals: distributions of goals scored by home teams (a) and by away
teams (b), distributions of the difference of goals (c) and the total goals (d)
)
less important, is that an increasing number of goals are scored throughout the 90
minutes. In the description of the DixonnullRobinson model (see Section 6) we will
see what affects this inhomogeneity during a match, and we will treat a factor that
it is essential in the estimation of the scoring rate, i.e. the current score.
16
Fig. 2. Histograms of goal times: distributions of all goal times for matches that ended in a home
win (a), a draw (b), and an away win (c); distribution for all matches (d)
쀀
4. The Dixon-Coles Model
As we have seen in the literature review, Dixon and Coles (1997) were the
first ones who developed a model able to allow for various features that are
necessary in a statistical model for football matches.
After the explanation of their model and its main points, we are going to
calculate the predicted outcomes using our dataset, and then we compare these
results to those that were obtained by Dixon and Coles (1997). In fact we thought
it was important to repeat the same analysis made by Dixon and Coles (1997) on
our available data since we are dealing with a completely different dataset.
In addition, it can be interesting to see if there are any significant discrepancies
between our results and theirs, and possibly try to understand what brought us to
other conclusions.
Their data include 6629 fullnulltime match results from the seasons 1992null93,
1993null94 and 1994null95, and they include the English Premier League and the
divisions 1, 2 and 3 of the English Football League. They considered English Cup
match results as well. This dataset differs from ours in four important and distinct
points:
(a) the country: there are substantial differences between Italian and English
football (for example, an English football match is in general more
brawny, whereas in Italy we find matches are usually more technical), so it
is not really sensible to compare a match between two Italian teams to a
match between to English teams;
(b) the range of teams: we keep in consideration just teams from the top
division, whereas they considered also teams from three lower divisions;
17
Ò
(c) types of matches: we analyse only matches from the championship, they
have match results from the national cup as well;
(d) the span of time: probably, football matches from the first half of the 90's
have different features and quality from matches of the period we
analysed.
Before starting the description of the DixonnullColes model, we need to
remember another small detail. Actually we use a slightly different version of the
model (essentially it is just a reparameterization), that puts the parameters on a
logarithmic scale compared to the “original” DixonnullColes model: we made this
small change since it simplifies the computational part.
4.1. The Model
In a match between teams indexed i and j, let X
i,j
and Y
i,j
be the number of
goals scored by the home and the away sides respectively. Then:
X
i , j
~ Poisson exp a.i b.j Y
i , j
~ Poisson exp a.j b.i (4.1)
where X
i,j
and Y
i,j
are independent, µ is a constant parameter, a.i and b.i are the
attack and the defence parameters of the inullth home team, a.j and b.j are the attack
and the defence parameters of the jnullth away team, whereas τ is a parameter which
allows for the home effect, i.e. the advantage of playing at home.
Dixon and Coles (1997) proposed an initial modification of the basic model
(4.1) that allows a dependence between the home team and the away team goals
for lownullscoring matches:
18
)
Pr X
i , j
= x , Y
i , j
= y = dep
h
, a
x , y h
x
exp− h
x!
a
y
exp− a
y!
(4.2)
where
h
= exp a.i b.j a
= exp a.j b.i and
dep
h
, a
x , y = { 1− h
a
if x= y= 0
1 h
if x= 0, y= 1
1 a
if x= 1, y= 0
1− if x= y= 1
1 otherwise
In this model ρ enters as a dependence parameter and it has to satisfy the
following constraint:
max− 1/ h
,− 1/ a
min 1/ h
a
, 1 Obviously ρ = 0 corresponds to independence, but otherwise the independence
distribution is perturbed for events with x ≤ 1 and y ≤ 1. It is easily checked that
the corresponding marginal distributions remain Poisson with means λ
h
and λ
a
respectively.
As we have already introduced, Dixon and Coles (1997) underscored another
significant limitation. The model (4.1) has static parameters, i.e. the attack and the
defence parameters of each team are regarded as constant through time.
Nevertheless, it is easy to imagine that a team's performance tends to be dynamic,
varying from one time period to another, and this behaviour should be
19
incorporated in the model. In particular a team's performance is likely to be more
closely related to their performance in recent matches than in earlier matches.
For this reason Dixon and Coles (1997) assumed that the parameters of each team
can vary match after match, and that historical information is of less value than
recent information. Under these assumptions, they determined parameter
estimates for each time point t that were based on the history of match scores up
to time t. In this way, they built a “pseudonulllikelihood” for each time point t:
L
t
a.i , b.i , , , ;i= 1,... ,n = ∏ k∈ A
t
{ dep
h , k
, a , k
x
k
, y
k
exp− h, k
h ,k
x
k
exp− a , k
a, k
y
k
} t− t
k
(4.3)
where t
k
is the time that match k was played, A
t
= {k : t
k
< t} and ϕ is a
nonnullincreasing function of time.
In (4.3) we introduced the index k that, as we said before, it represents a knullth
match of our dataset. This notation does not mean that every knullth match has been
played at a different time, since, as we know, in a common football season more
matches can be played on the same day. That is why we can have matches that are
played at the same time (i.e. concurrent matches, so that t
k
= t
k+1
= t
k+2
= …, for
example) and also matches that are played at different times (so that t
k
= t
k+1
= t
k+2
= … > t
k+11
= t
k+12
= … ).
Maximizing equation (4.3) at time t leads to parameter estimates which are
based on games up to time t only. In this way, the model has the capacity to reflect
changes in team's performance. Consequently the choice of ϕ is a necessary aspect
that we absolutely need to keep in mind. In fact we have to remember that every
variation of ϕ allows historical data to be downweighted in the likelihood to a
greater or lesser degree.
20
蘀г Dixon and Coles (1997) found out that their whole model works well with a
weighting function ϕ given by:
t = exp− t with ξ > 0. The static model arises as the special case ξ = 0, whereas taking
increasingly large values of the parameter ξ gives relatively more weight to the
latest results. However, optimizing the choice of ξ is a little bit problematic, since
the equation (4.3) defines a temporal sequence of nonnullindependent likelihoods,
whereas we require ξ such that the overall predictive capability of the model is
maximized. Therefore it is pragmatic to choose ξ to optimize the prediction of
outcomes. Furthermore for this reason (4.3) is defined as a “pseudonulllikelihood”,
since it is not a proper likelihood. In fact it makes sense as an estimating equation
for team parameters and so on, but it is not appropriate because of the inclusion of
the parameter ξ. However, this likelihood would be good for a fixed value of ξ,
but this would mean that it does not make sense to maximize over ξ as the missing
scaling factor is also a function of ξ and it has been removed. That is why Dixon
and Coles (1997) fitted the model (4.3) for a range of different values of ξ and
calculated the lognullscore on the period of prediction for each of these values. Then
they kept the value of ξ that maximizes the lognullscore.
At first they determined the score probabilities from the maximization of the
model (4.3) at t(k), the time of the match k, as:
p
k
H
= ∑ l , m∈ B
H
Pr X
k
= l , Y
k
= m p
k
A
= ∑ l ,m∈ B
A
Pr X
k
= l , Y
k
= m p
k
D
= ∑ l , m∈ B
D
Pr X
k
= l , Y
k
= m (4.4)
21
憠а where p
k
H
, p
k
A
and p
k
D
are respectively the probabilities of a home win, an away
win and a draw in one match k, whereas B
H
= {(l,m) : l > m}, B
A
= {(l,m) : l < m}
and B
D
= {(l,m) : l = m}. Now let ind.H
k
be an indicator variable so that ind.H
k
= 1
if the home team had won the knullth match and ind.H
k
= 0 otherwise (and in the
same way we indicate ind.A
k
and ind.D
k
for the away win and the draw
respectively). Then, the overall lognullscore of all matches is given by:
log L = ∑ k= 1
N
ind.H
k
log p
k
H
ind.A
k
log p
k
A
ind.D
k
log p
k
D
(4.5)
Dixon and Coles (1997) found that the function (4.5) is maximized at ξ = 0.0065
and then, using this value, they maximized equation (4.3) in order to find a
complete set of parameter estimates at each time point t. In this way they gave a
profile of each team's changing performance in terms of defence and attack
abilities.
4.2. Application to our dataset
As we have already introduced, we are going to apply in this section the
DixonnullColes model to our dataset. Then we will show that our final considerations
will be really close to what Dixon and Coles (1997) obtained from their analysis,
despite the fact that we are using data with different features than theirs.
Therefore we firstly need to choose an appropriate value of the parameter ξ.
Following the same procedure we illustrated in the previous section, we find that
the lognullscore, applied to our data, gets the maximum when ξ = 0.0055, therefore a
very close value to what Dixon and Coles (1997) found out from their work
(0.0065).
Then using ξ = 0.0055 we fit the model to almost the whole period of our
22