Big Data/Analytics Zone is brought to you in partnership with:

Arthur Charpentier, ENSAE, PhD in Mathematics (KU Leuven), Fellow of the French Institute of Actuaries, professor at UQàM in Actuarial Science. Former professor-assistant at ENSAE Paritech, associate professor at Ecole Polytechnique and professor assistant in economics at Université de Rennes 1. Arthur is a DZone MVB and is not an employee of DZone and has posted 158 posts at DZone. You can read more from them at their website. View Full User Profile

Overdispersion with Different Exposures

02.05.2013
| 1569 views |
  • submit to reddit

In actuarial science, and insurance ratemaking, taking into account the exposure can be a nightmare (in datasets, some clients have been here for a few years – we call that exposure – while others have been here for a few months, or weeks). Somehow, simple results because more complicated to compute just because we have to take into account the fact that exposure is an heterogeneous variable.

The exposure in insurance ratemaking can be seen as a problem of censored data (in my dataset, the exposure is always smaller than 1 since observations are contracts, not policyholders),

  • the number of claims http://latex.codecogs.com/gif.latex?N_i on the period http://latex.codecogs.com/gif.latex?[0,1]
  • the number of claims http://latex.codecogs.com/gif.latex?Y_i on http://latex.codecogs.com/gif.latex?[0,E_i])

And as always, the variable of interest is the unobserved one, because we have to price insurance contract with a cover period of one (full) year. So we have to model the yearly frequency of insurance claims.

http://f.hypotheses.org/wp-content/blogs.dir/253/files/2013/02/Capture-d%E2%80%99e%CC%81cran-2013-02-01-a%CC%80-09.30.00.png

In our dataset, we have http://latex.codecogs.com/gif.latex?(Y_i,E_i)‘s – or more generally also some additional covariates http://latex.codecogs.com/gif.latex?(Y_i,E_i,\boldsymbol{X}_i)‘s. For ratemaking, we need to estimate http://latex.codecogs.com/gif.latex?\mathbb{E}(N\vert\boldsymbol{X}=\boldsymbol{x}) and perhaps also http://latex.codecogs.com/gif.latex?\text{Var}(N|\boldsymbol{X}=\boldsymbol{x}) (for instance to test if the Poisson assumption is valid, or not). To estimate the expected value, a natural estimate for http://latex.codecogs.com/gif.latex?\mathbb{E}(N) (forget about covariates as a start) is
http://latex.codecogs.com/gif.latex?m_N=\frac{\sum_{i=1}^n%20Y_i}{\sum_{i=1}^n%20E_i}
which is also the weight average of annualized individual counts
http://latex.codecogs.com/gif.latex?m_N=\sum_{i=1}^n%20\frac{%20E_i}{\sum_{i=1}^n%20E_i}%20\cdot%20\frac{Y_i}{E_i}
We consider the ratio of the total number of claims to the total exposure-to-
risk. This estimate appears for instance if we consider a Poisson process, so that http://latex.codecogs.com/gif.latex?N\sim\mathcal{P}(\lambda) while http://latex.codecogs.com/gif.latex?Y\sim\mathcal{P}(\lambda%20\cdot%20E). Then, the likelihood is

http://latex.codecogs.com/gif.latex?\mathcal{L}(\lambda,\boldsymbol{Y},\boldsymbol{E})=\prod_{i=1}^n%20\frac{e^{-\lambda%20E_i}%20[\lambda%20E_i]^{Y_i}}{Y_i!}

i.e.

http://latex.codecogs.com/gif.latex?\log%20\mathcal{L}(\lambda,\boldsymbol{Y},\boldsymbol{E})%20=%20-\lambda%20\sum_{i=1}^n%20E_i%20+\sum_{i=1}^n%20Y_i%20\log[\lambda%20E_i]%20-%20\log\left(\prod_{i=1}^n%20Y_i!\right)

The first order condition is here

http://latex.codecogs.com/gif.latex?\frac{\partial}{\partial%20\lambda}\log%20\mathcal{L}(\lambda,\boldsymbol{Y},\boldsymbol{E})%20=%20%20-%20\sum_{i=1}^n%20E_i%20+\frac{1}{\lambda}\sum_{i=1}^n%20Y_i%20=0

which is satisfied if

http://latex.codecogs.com/gif.latex?\widehat{\lambda}=\frac{\sum_{i=1}^n%20Y_i}{\sum_{i=1}^n%20E_i}

So, we do have an estimator for the expected value, and a natural estimator for http://latex.codecogs.com/gif.latex?\mathbb{E}(N\vert\boldsymbol{X}=\boldsymbol{x}) is then (if we consider categorical covariates)
http://latex.codecogs.com/gif.latex?m_{N|\boldsymbol{x}}%20=\frac{\sum_{i,\boldsymbol{X}_i=\boldsymbol{x}}%20Y_i}{\sum_%20{i,\boldsymbol{X}_i=\boldsymbol{x}}%20E_i}

Now, we need an estimate for the variance, or more precisely the conditional variable. Assume (as a starting point) that all have the same exposure http://latex.codecogs.com/gif.latex?E. For instance, if http://latex.codecogs.com/gif.latex?E is one half, insured were observed only the first six months. Then http://latex.codecogs.com/gif.latex?N=Y+Y%27 with http://latex.codecogs.com/gif.latex?Y\overset{\mathcal%20L}{=}Y%27 (http://latex.codecogs.com/gif.latex?Y is the number of claims on the first six months, while http://latex.codecogs.com/gif.latex?Y%27 are the number of claims on the last six months), i.e. http://latex.codecogs.com/gif.latex?\text{Var}(N)=\text{Var}(Y)+%20\text{Var}(Y%27) if we assume independent increments. I.e.
http://latex.codecogs.com/gif.latex?\text{Var}(N)=2\text{Var}(Y), or conversely http://latex.codecogs.com/gif.latex?E%20\cdot\text{Var}(N)=\text{Var}(Y). More generally, it is reasonable to assume that

http://latex.codecogs.com/gif.latex?\text{Var}(Y)=E\cdot%20\text{Var}(N)
for all values of http://latex.codecogs.com/gif.latex?E. And then
http://latex.codecogs.com/gif.latex?\text{Var}\left(\frac{Y}{E}\right)=\frac{1}{E}\cdot%20\text{Var}(N)
Thus, it seems legitimate to assume that the empirical variance of http://latex.codecogs.com/gif.latex?N can be written
http://latex.codecogs.com/gif.latex?S_N^2=E\cdot%20S_{Y/E}^2
Since the average of http://latex.codecogs.com/gif.latex?Y_i/E is http://latex.codecogs.com/gif.latex?\overline{N}=m_N, then
http://latex.codecogs.com/gif.latex?S_N^2=E\cdot%20\frac{1}{n}\sum_{i=1}^n%20\left[\frac{Y_i}{E}-\overline{N}\right]^2}%20=%20\frac{1}{n}\sum_{i=1}^n%20E\left[\frac{Y_i}{E}-\overline{N}\right]^2}or equivalently
http://latex.codecogs.com/gif.latex?S_N^2=\frac{1}{n}\sum_{i=1}^n%20\frac{E}{E^2}\left[Y_i-\overline{N}\cdot%20E\right]^2}%20=\frac{1}{n}\sum_{i=1}^n%20\frac{1}{E}[Y_i-\overline{N}\cdot%20E]^2http://latex.codecogs.com/gif.latex?S_N^2=\frac{\sum_{i=1}^n%20[Y_i-\overline{N}\cdot%20E]^2%20}{nE}Thus, with different http://latex.codecogs.com/gif.latex?E_i‘s, it would be legitimate (I guess) to consider
http://latex.codecogs.com/gif.latex?S_N^2=\frac{\sum_{i=1}^n%20[Y_i-\overline{N}\cdot%20E_i]^2%20}{\sum_{i=1}^n%20E_i}Thus, an estimator forhttp://latex.codecogs.com/gif.latex?\text{Var}(N|\boldsymbol{X}=\boldsymbol{x})is
http://latex.codecogs.com/gif.latex?S_{N|\boldsymbol{x}}^2=\frac{\sum_{i,\boldsymbol{X}_i=\boldsymbol{x}}%20[Y_i-\overline{N}\cdot%20E_i]^2}{\sum_{i,\boldsymbol{X}_i=\boldsymbol{x}%20}%20E_i}

This can be used to test is the Poisson assumption is valid to model frequency. Consider the following dataset,

>  sinistre=read.table("http://freakonometrics.free.fr/sinistreACT2040.txt",
+  header=TRUE,sep=";")
>  sinistres=sinistre[sinistre$garantie=="1RC",]
>  sinistres=sinistres[sinistres$cout>0,]
>  contrat=read.table("http://freakonometrics.free.fr/contractACT2040.txt",
+  header=TRUE,sep=";")
>  T=table(sinistres$nocontrat)
>  T1=as.numeric(names(T))
>  T2=as.numeric(T)
>  nombre1 = data.frame(nocontrat=T1,nbre=T2)
>  I = contrat$nocontrat%in%T1
>  T1= contrat$nocontrat[I==FALSE]
>  nombre2 = data.frame(nocontrat=T1,nbre=0)
>  nombre=rbind(nombre1,nombre2)
>  baseFREQ = merge(contrat,nombre)

Here, we do have our two variables of interest, the exposure, per contract,

>  E <- baseFREQ$exposition

and the (observed) number of claims (during that time frame)

>  Y <- baseFREQ$nbre

It is possible to compute without covariates, the average (yearly) number of claims, per contract, and the associated variance

> (mean=weighted.mean(Y/E,E))
[1] 0.07279295
> (variance=sum((Y-mean*E)^2)/sum(E)) 
[1] 0.08778567

It looks like the variance is (slightly) larger than the average (we’ll see in a few weeks how to test it, more formally). It is possible to add covariates, for instance the density of population, in the area where the policyholder lives,

>  X=as.factor(baseFREQ$densite)
>  for(i in 1:length(levels(X))){
+ 	   Ei=E[X==levels(X)[i]]
+ 	   Yi=Y[X==levels(X)[i]]
+  (meani=weighted.mean(Yi/Ei,Ei))    # moyenne 
+  (variancei=sum((Yi-meani*Ei)^2)/sum(Ei))    # variance
+ cat("Density, zone",levels(X)[i],"average =",meani," variance =",variancei,"\n")
+ }
Density, zone 11 average = 0.07962411  variance = 0.08711477 
Density, zone 21 average = 0.05294927  variance = 0.07378567 
Density, zone 22 average = 0.09330982  variance = 0.09582698 
Density, zone 23 average = 0.06918033  variance = 0.07641805 
Density, zone 24 average = 0.06004009  variance = 0.06293811 
Density, zone 25 average = 0.06577788  variance = 0.06726093 
Density, zone 26 average = 0.0688496   variance = 0.07126078 
Density, zone 31 average = 0.07725273  variance = 0.09067 
Density, zone 41 average = 0.03649222  variance = 0.03914317 
Density, zone 42 average = 0.08333333  variance = 0.1004027 
Density, zone 43 average = 0.07304602  variance = 0.07209618 
Density, zone 52 average = 0.06893741  variance = 0.07178091 
Density, zone 53 average = 0.07725661  variance = 0.07811935 
Density, zone 54 average = 0.07816105  variance = 0.08947993 
Density, zone 72 average = 0.08579731  variance = 0.09693305 
Density, zone 73 average = 0.04943033  variance = 0.04835521 
Density, zone 74 average = 0.1188611   variance = 0.1221675 
Density, zone 82 average = 0.09345635  variance = 0.09917425 
Density, zone 83 average = 0.04299708  variance = 0.05259835 
Density, zone 91 average = 0.07468126  variance = 0.3045718 
Density, zone 93 average = 0.08197912  variance = 0.09350102 
Density, zone 94 average = 0.03140971  variance = 0.04672329

Perhaps graphs would be a nice tool to play with, to visualize that information

> plot(meani,variancei,cex=sqrt(Ei),col="grey",pch=19,
+ xlab="Empirical average",ylab="Empirical variance")
> points(meani,variancei,cex=sqrt(Ei))

http://f.hypotheses.org/wp-content/blogs.dir/253/files/2013/02/Capture-d%E2%80%99e%CC%81cran-2013-02-01-a%CC%80-10.51.26.png

The size of the circles is related to the size of the group (the area is proportional to the total exposure within the group). The first diagonal corresponds to the Poisson model, i.e. the variance should be equal to the mean. It is also possible to consider other covariates, like the gas type

http://f.hypotheses.org/wp-content/blogs.dir/253/files/2013/02/Capture-d%E2%80%99e%CC%81cran-2013-02-01-a%CC%80-10.52.02.png

or the car brand,

http://f.hypotheses.org/wp-content/blogs.dir/253/files/2013/02/Capture-d%E2%80%99e%CC%81cran-2013-02-01-a%CC%80-10.50.49.png

It is also possible to consider the age of the driver as a categorical variate

http://f.hypotheses.org/wp-content/blogs.dir/253/files/2013/02/Capture-d%E2%80%99e%CC%81cran-2013-02-01-a%CC%80-10.51.40.png

Actually, the age is interesting: we can observe on that dataset a feature that Jean-Philippe Boucher observed also on his own datasets. Let us look more carefully where are the different ages,

http://f.hypotheses.org/wp-content/blogs.dir/253/files/2013/02/Capture-d%E2%80%99e%CC%81cran-2013-02-01-a%CC%80-10.55.17.png

On the right, we can observe young (unexperienced) drivers. That was expected. But some classes are belowthe first diagonal: the expected frequency is large, but not the variance. I.e. we know for sure that young drivers have more car accidents. It is not an heterogeneous class, on the contrary: young drivers can be seen as a relatively homogeneous class, with a high frequency of car accidents.

With the original dataset (here, I use only a subset with 50,000 clients), we do obtain the following graph:

http://f.hypotheses.org/wp-content/blogs.dir/253/files/2013/02/Capture-d%E2%80%99e%CC%81cran-2013-02-01-a%CC%80-11.27.04.png

If we do not observe underdispersion for young drivers, observe that those are incredibly homogeneous classes. With a clear impact of experience, since circles are moving downward from age 18 to 25.

Another disturbing story (this was – one more time – suggestion from Jean-Philippe) that it might be possible to consider the exposure as a standard variable, and see if the coefficient is actually equal to 1. Without any covariate,

>  reg=glm(Y~log(E),family=poisson("log"))
>  summary(reg)

Call:
glm(formula = Y ~ log(E), family = poisson("log"))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.3988  -0.3388  -0.2786  -0.1981  12.9036  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.83045    0.02822 -100.31   <2e-16 ***
log(E)       0.53950    0.02905   18.57   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 12931  on 49999  degrees of freedom
Residual deviance: 12475  on 49998  degrees of freedom
AIC: 16150

Number of Fisher Scoring iterations: 6

i.e. the parameter is clearly strictly smaller than 1. And it is neither related to significance,

> library(car)
> linearHypothesis(reg,"log(E)",1)
Linear hypothesis test

Hypothesis:
log(E) = 1

Model 1: restricted model
Model 2: Y ~ log(E)

  Res.Df Df  Chisq Pr(>Chisq)    
1  49999                         
2  49998  1 251.19  < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

nor to the fact that I did not take into account covariates,

> reg=glm(nbre~log(exposition)+carburant+as.factor(ageconducteur)+as.factor(densite),family=poisson("log"),data=baseFREQ)
>  summary(reg)

Call:
glm(formula = nbre ~ log(exposition) + carburant + as.factor(ageconducteur) + 
    as.factor(densite), family = poisson("log"), data = baseFREQ)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.7114  -0.3200  -0.2637  -0.1896  12.7104  

Coefficients:
                              Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  -14.07321  181.04892  -0.078 0.938042    
log(exposition)                0.56781    0.03029  18.744  < 2e-16 ***
carburantE                    -0.17979    0.04630  -3.883 0.000103 ***
as.factor(ageconducteur)19    12.18354  181.04915   0.067 0.946348    
as.factor(ageconducteur)20    12.48752  181.04902   0.069 0.945011

(etc). So it might be a too strong assumption to assume that the exposure is an exogenous variate here. But that’s another story !

Published at DZone with permission of Arthur Charpentier, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)