## Description of predictor

$\newcommand{\P}{\mathbb{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\half}{\frac12} \newcommand{\norm}[1]{\lVert#1\rVert}$

Work in progress…

Each league is treated separately, apart from when the parameters for the priors are worked out. (Promotion and relegation means that there is a dependency of different leagues on each other from one season to the next.)

Sticking with one league, each team in a league is given one attack and one defence parameter, and there is an overall “home advantage” parameter and an “away disadvantage” parameter. If the league has $n$ teams, then the parameters are denoted:

• Attack: $a_1,\ldots,a_n$,
• Defence: $d_1,\ldots,d_n$,
• Home advantage: $h$,
• Away disadvantage: $a$.

Goals scored are assumed to follow a Poisson process: each team is taken to have a certain scoring rate constant over the game. If team $i$ plays team $j$ ($i$, being the first-named team, is taken to be playing at home) then the Poisson parameters are taken to be
\begin{split}
\mu_{ij}&=e^{a_j-d_i-a},\label{eq:ppmodel}
\end{split}
the scoring rates of the home and away teams, respectively.
(Digression: choice of exponential function…)

There are two redundant degrees of freedom since shifting all $a_i$ or all $d_i$ by a constant can be absorbed into $h$ and $a$, leaving $\lambda_{ij}$ and $\mu_{ij}$ unchanged. (Taking this into account is not the same as doing away with $h$ and $a$ altogether: $h-a$ is arbitrary, but $h+a$ is a genuine degree of freedom reflecting how much overall advantage lies with the home team.) Thus there are $2n$ degrees of freedom overall and the symmetry group, $G$, is generated by the two sorts of shifts:
\begin{align*}
\end{align*}This symmetry is a bit of a nuisance as it makes it harder to make a meaningful comparison between teams from different leagues, which is necessary in calculating the prior distribution on the parameters. In practice we fix the choice by imposing $\sum_i a_i=\sum_i d_i=0$, but we have to correctly divide out by the action of G first in calculating the prior (see below).

### Prior term in likelihood

For the purposes of representing the prior, the parameters are treated uniformly and given the alternative names $(x_i)_{i=1}^{2n+2}$, where

\vec x=(x_1,\ldots,x_{2n+2})=(a_1,\ldots,a_n,d_1,\ldots,d_n,h,a)\label{eq:alias}.
Means $\mu_1,\ldots,\mu_{2n+2}$ and inverse variances $i_1,\ldots,i_{2n+2}$ are calculated from the league results of previous years (see below). $\vec x$ is then assumed to have the $2n$-dimensional distribution of $\vec\xi=(\xi_1,\ldots,\xi_{2n+2})$ integrated over the action of $G$ on $\vec \xi$, where $\xi_i$ are distributed as independent normals: $\xi_i\sim N(\mu_i,i_i^{-1})$. The $2n$-dimensional density of $(x_1,\ldots,x_{2n+2})$ is then proportional to
$$\int_{-\infty}^\infty d\lambda_1 \int_{-\infty}^\infty d\lambda_2 \exp\left(-\half\sum_{i=1}^{2n+2} i_i(x_i+\lambda_1g^1_i+\lambda_2g^2_i-\mu_i)^2\right),$$
where $\vec g^1=(1,\ldots,1,0,\ldots,0,-1,1)$ and $\vec g^2=(0,\ldots,0,1,\ldots,1,1,-1)$ generate $G$. This only makes sense as a probability distribution when restricted to a $2n$-dimensional surface, our current choice being $W=\{\vec x\vert\sum_i a_i=\sum_i d_i=0\}$.

It is not necessary to actually do any integration here: instead you can use standard properties of Gaussian measures. Consider the inner product on $V=\R^{2n+2}$ given by $\vec x.\vec y=\sum_i i_i x_i y_i$, with corresponding norm $\norm{.}$ and volume form $\omega_V=(2\pi)^{-(n+1)}(i_1\ldots i_{2n+2})^{\half}dx_1\ldots dx_{2n+2}$. The unintegrated prior density as a function of $\vec\xi\in V$ is proportional to $\exp(-\half\norm{\vec\xi-\vec\mu}^2)\omega_V(\vec\xi)$, and the integrated density as a function of $\vec x\in V$ is proportional to $\exp(-\half\norm{\pi_{G^\bot}(\vec x-\vec\mu)}^2)\omega_{G^\bot}(\vec x)$.

Here, $G^\bot$ is the subspace orthogonal to $<\vec g^1,\vec g^2>$ with respect to the above inner product and $\pi_{G^\bot}:V\to G^\bot$ is the corresponding orthogonal projection. As before, we restrict to $W$ to get a probability distribution.

In practice it’s convenient to calculate $\norm{\pi_{G^\bot}(\vec v)}^2$ as $\norm{\vec v}^2-\norm{\pi_G(\vec v)}^2$, since $G$ is only two-dimensional. Concretely, this amounts to the prior component of the log-likelihood being given by
\begin{split}
L_\mathrm{pr}(\vec x; \vec\mu, \vec i)&=\half\sum_{i=1}^{2n+2}i_i(x_i-\mu_i)^2\\
&-\half
\begin{matrix}\begin{pmatrix}w_1 & w_2\end{pmatrix}\\\mbox{}\end{matrix}
\begin{pmatrix} A&B\\B&C\end{pmatrix}^{-1}
\begin{pmatrix}w_1\\w_2\end{pmatrix},
\label{eq:Lpr1}\end{split}
where
\begin{split}
A&=\left(\sum_{i=1}^n i_i\right)+i_{2n+1}+i_{2n+2}\\
B&=i_{2n+1}+i_{2n+2}\\
C&=\left(\sum_{i=n+1}^{2n}i_i\right)+i_{2n+1}+i_{2n+2}\\
w_1&=\left(\sum_{i=1}^n i_i(x_i-\mu_i)\right)-i_{2n+1}(x_{2n+1}-\mu_{2n+1})+i_{2n+2}(x_{2n+2}-\mu_{2n+2})\\
w_2&=\left(\sum_{i=n+1}^{2n}i_i(x_i-\mu_i)\right)+i_{2n+1}(x_{2n+1}-\mu_{2n+1})-i_{2n+2}(x_{2n+2}-\mu_{2n+2})
\label{eq:Lpr2}\end{split}

(Digression: why is $\omega_{G^\bot}$ the correct form? …it’s the one that normalises the set of MCMC trial functions… Expand.)

### Evidence term in likelihood

The games that have been played in the league to date can be described by a set, $P$, of quadruples $(t,t’,g,g’)$. $t$ is the home team, $g$ the number of goals scored by the home team, and $t’$, $g’$ the corresponding values for the away team.

The evidence component of the log-likelihood is then given by
$$L_\mathrm{ev}(\vec x;P)=\sum_{(t,t’,g,g’)\in P}\log\left(\frac{e^{-\lambda_{tt’}}\lambda_{tt’}^g}{g!}\frac{e^{-\mu_{tt’}}\mu_{tt’}^{g’}}{g’!}\right).\label{eq:Lev}$$

### MCMC update step

This is a standard Metropolis-Hastings update.

The total log likelihood is given by
L_\mathrm{tot}(\vec x) = L_\mathrm{tot}(\vec x; \vec\mu, \vec i, P) = L_\mathrm{pr}(\vec x; \vec\mu, \vec i) + L_\mathrm{ev}(\vec x; P)

The new trial parameters are given by
\begin{equation*}
\vec x’ = \vec x + \pi_W\vec\Delta,
\end{equation*}
where $\vec\Delta$ is a vector of independent $N(0,\sigma^2)$s and $\pi_W$ is the orthogonal projection $V\to W$, this time with respect to the standard Euclidean norm. In components, $\Delta_i$ are IID $N(0,\sigma^2)$ and
\begin{equation*}
x’_i=\begin{cases}x_i+\Delta_i-\frac{1}{n}\sum_{j=1}^n \Delta_j &\text{if }\; 1\le i\le n\\
x_i+\Delta_i-\frac{1}{n}\sum_{j=n+1}^{2n} \Delta_j &\text{if }\; n+1\le i\le 2n\\
x_i+\Delta_i&\text{if }\; i=2n+1,2n,
\end{cases}
\end{equation*}
and the new value $\vec x’$ is accepted with probability
$$\min(\exp(L_\mathrm{tot}(\vec x’)-L_\mathrm{tot}(\vec x)), 1).$$

$\sigma$ is chosen to maximise $\sigma^2\P(\text{accept})$, which is a measure of progress through the parameter space. Empirically, this is achieved at $\sigma\approx0.04$.

This update step is repeated a certain fixed number of times (default 10) per reuse of the parameters.
The number of repetitions used is not enough to achieve independence from one set of parameters to the next set, but this does not matter because the only statistics gathered are functions of the current set of parameters only. In other words we are not trying to gather correlations between one set and another. The number of reuse steps is typically in the tens of thousands, which should be enough to achieve a reasonable coverage of the parameter space.