Big Data/Analytics Zone is brought to you in partnership with:

Arthur Charpentier, ENSAE, PhD in Mathematics (KU Leuven), Fellow of the French Institute of Actuaries, professor at UQàM in Actuarial Science. Former professor-assistant at ENSAE Paritech, associate professor at Ecole Polytechnique and professor assistant in economics at Université de Rennes 1. Arthur is a DZone MVB and is not an employee of DZone and has posted 158 posts at DZone. You can read more from them at their website. View Full User Profile

Modeling Individual Losses with Mixtures

02.23.2013
| 1606 views |
  • submit to reddit

Usually, the sentence that I keep saying in my regression classes is “please, look at your data“. In our previous post, we’ve been playing like most econometricians: we did not look at the data. Actually, if we look at the distribution of individual losses, in the dataset, we see the following,

> n=nrow(couts)
> plot(sort(couts$cout),(1:n)/(n+1),xlim=c(0,10000),type="s",lwd=2,col="green")

http://f.hypotheses.org/wp-content/blogs.dir/253/files/2013/02/Capture-d%E2%80%99e%CC%81cran-2013-02-13-a%CC%80-16.10.26.png

It looks like there are fixed costs claims in our database. How do we deal with it in the standard case (e.g. in Loss Models textbook) ? We can use a mixture of – at least – three distributions here,

with

  • a distribution for small claims, http://latex.codecogs.com/gif.latex?{\color{Blue}%20f_1(}\cdot{\color{Blue}%20)}, e.g. an exponential distribution
  • a Dirac mass in http://latex.codecogs.com/gif.latex?{\color{Magenta}%20\kappa}, i.e. http://latex.codecogs.com/gif.latex?{\color{Magenta}%20\delta_{\kappa}(}\cdot{\color{Magenta}%20)}
  • a distribution for larger claims, http://latex.codecogs.com/gif.latex?{\color{Red}%20f_3(}\cdot{\color{Red}%20)}, e.g. a Gamma, or a lognormal, distribution
>  I1=which(couts$cout<1120)
>  I2=which((couts$cout>=1120)&(couts$cout<1220))
>  I3=which(couts$cout>=1220)
>  (p1=length(I1)/nrow(couts))
[1] 0.3284823
>  (p2=length(I2)/nrow(couts))
[1] 0.4152807
>  (p3=length(I3)/nrow(couts))
[1] 0.256237
>  X=couts$cout
>  (kappa=mean(X[I2]))
[1] 1171.998
>  X0=X[I3]-kappa
>  u=seq(0,10000,by=20)
>  F1=pexp(u,1/mean(X[I1]))
>  F2= (u>kappa)
>  F3=plnorm(u-kappa,mean(log(X0)),sd(log(X0))) * (u>kappa)
>  F=F1*p1+F2*p2+F3*p3
>  lines(u,F)

http://f.hypotheses.org/wp-content/blogs.dir/253/files/2013/02/Capture-d%E2%80%99e%CC%81cran-2013-02-13-a%CC%80-16.13.43.png

In our previous post, we’ve discussed the idea that all parameters might be related to some covariates, i.e.

http://latex.codecogs.com/gif.latex?f(y|\boldsymbol{X})%20=%20p_1(\boldsymbol{X})%20{\color{Blue}%20f_1(}y|\boldsymbol{X}{\color{Blue}%20)}%20+%20p_2(\boldsymbol{X})%20{\color{Magenta}%20\delta_{\kappa}(}y{\color{Magenta}%20)}%20+%20p_3(\boldsymbol{X})%20{\color{Red}%20f_3(}y|\boldsymbol{X}{\color{Red}%20)}

which yield the following premium model,

http://latex.codecogs.com/gif.latex?\mathbb{E}(Y|\boldsymbol{X})%20=%20{\color{Blue}%20{\underbrace{\mathbb{E}(Y|\boldsymbol{X},Y\leq%20s_1)}_{A}%20\cdot%20{\underbrace{\mathbb{P}(Y\leq%20s_1|\boldsymbol{X})}_{D}}}}\\+{\color{Purple}%20{{\underbrace{\mathbb{E}(Y|Y\in(%20s_1,s_2],%20\boldsymbol{X})%20}_{B}}\cdot%20{\underbrace{\mathbb{P}(Y\in(%20s_1,s_2]|%20\boldsymbol{X})}_{D}}}}\\+{\color{Red}%20{{\underbrace{\mathbb{E}(Y|Y%3E%20s_2,%20\boldsymbol{X})%20}_{C}}\cdot%20{\underbrace{\mathbb{P}(Y%3E%20s_2|%20\boldsymbol{X})}_{D}}}}

For the http://latex.codecogs.com/gif.latex?{\color{Blue}%20A}http://latex.codecogs.com/gif.latex?{\color{Magenta}%20B} and http://latex.codecogs.com/gif.latex?{\color{Red}%20C} terms, that’s easy, we can use standard models we’ve seen in the course. For the probability, we should use a multinomial model. Recall that for the logistic regression model, if http://latex.codecogs.com/gif.latex?(\pi,1-\pi)=(\pi_1,\pi_2), then

http://latex.codecogs.com/gif.latex?\log%20\frac{\pi}{1-\pi}=\log%20\frac{\pi_1}{\pi_2}%20=\boldsymbol{X}%27\boldsymbol{\beta}

i.e.

http://latex.codecogs.com/gif.latex?\pi_1%20=%20\frac{\exp(\boldsymbol{X}%27\boldsymbol{\beta})}{1+\exp(\boldsymbol{X}%27\boldsymbol{\beta})}

and

http://latex.codecogs.com/gif.latex?\pi_2%20=%20\frac{1}{1+\exp(\boldsymbol{X}%27\boldsymbol{\beta})}

To derive a multivariate extension, write

http://latex.codecogs.com/gif.latex?\pi_1%20=%20\frac{\exp(\boldsymbol{X}%27\boldsymbol{\beta}_1)}{1+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_1)+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_2)}

http://latex.codecogs.com/gif.latex?\pi_2%20=%20\frac{\exp(\boldsymbol{X}%27\boldsymbol{\beta}_2)}{1+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_1)+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_2)}

and

http://latex.codecogs.com/gif.latex?\pi_3%20=%20\frac{1}{1+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_1)+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_2)}

Again, maximum likelihood techniques can be used, since

http://latex.codecogs.com/gif.latex?\mathcal{L}(\boldsymbol{\pi},\boldsymbol{y})\propto%20\prod_{i=1}^n%20\prod_{j=1}^3%20\pi_{i,j}^{Y_{i,j}}

where here, variable http://latex.codecogs.com/gif.latex?Y_{i}  – which take three levels – is splitted in three indicators (like any categorical explanatory variables in standard regression model). Thus,

http://latex.codecogs.com/gif.latex?\log%20\mathcal{L}(\boldsymbol{\beta},\boldsymbol{y})\propto%20\sum_{i=1}^n%20\sum_{j=1}^2%20\left(Y_{i,j}%20\boldsymbol{X}_i%27\boldsymbol{\beta}_j\right)%20-%20n_i\log\left[1+1+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_1)+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_2)\right]

and, as for the logistic regression, then use Newton Raphson’ algorithm to compute numerically the maximum likelihood. In R, first we have to define the levels, e.g.

> seuils=c(0,1120,1220,1e+12)
> couts$tranches=cut(couts$cout,breaks=seuils,
+ labels=c("small","fixed","large"))
> head(couts,5)
  nocontrat    no garantie    cout exposition zone puissance agevehicule
1      1870 17219      1RC 1692.29       0.11    C         5           0
2      1963 16336      1RC  422.05       0.10    E         9           0
3      4263 17089      1RC  549.21       0.65    C        10           7
4      5181 17801      1RC  191.15       0.57    D         5           2
5      6375 17485      1RC 2031.77       0.47    B         7           4
  ageconducteur bonus marque carburant densite region tranches
1            52    50     12         E      73     13    large
2            78    50     12         E      72     13    small
3            27    76     12         D      52      5    small
4            26   100     12         D      83      0    small
5            46    50      6         E      11     13    large

Then, we can run a multinomial regression, from

> library(nnet)

using some selected covariates

> reg=multinom(tranches~ageconducteur+agevehicule+zone+carburant,data=couts)
# weights:  30 (18 variable)
initial  value 2113.730043 
iter  10 value 2063.326526
iter  20 value 2059.206691
final  value 2059.134802 
converged

The output is here

> summary(reg)
Call:
multinom(formula = tranches ~ ageconducteur + agevehicule + zone + 
    carburant, data = couts)

Coefficients:
      (Intercept) ageconducteur agevehicule      zoneB      zoneC
fixed  -0.2779176   0.012071029  0.01768260 0.05567183 -0.2126045
large  -0.7029836   0.008581459 -0.01426202 0.07608382  0.1007513
           zoneD      zoneE      zoneF   carburantE
fixed -0.1548064 -0.2000597 -0.8441011 -0.009224715
large  0.3434686  0.1803350 -0.1969320  0.039414682

Std. Errors:
      (Intercept) ageconducteur agevehicule     zoneB     zoneC     zoneD
fixed   0.2371936   0.003738456  0.01013892 0.2259144 0.1776762 0.1838344
large   0.2753840   0.004203217  0.01189342 0.2746457 0.2122819 0.2151504
          zoneE     zoneF carburantE
fixed 0.1830139 0.3377169  0.1106009
large 0.2160268 0.3624900  0.1243560

To visualize the impact of a covariate (one, only), one can use also spline functions

> library(splines)
> reg=multinom(tranches~agevehicule,data=couts)
# weights:  9 (4 variable)
initial  value 2113.730043 
final  value 2072.462863 
converged
> reg=multinom(tranches~bs(agevehicule),data=couts)
# weights:  15 (8 variable)
initial  value 2113.730043 
iter  10 value 2070.496939
iter  20 value 2069.787720
iter  30 value 2069.659958
final  value 2069.479535 
converged

For instance, if the covariate is the age of the car, we do have the following probabilities

> predict(reg,newdata=data.frame(agevehicule=5),type="probs")
    small     fixed     large 
0.3388947 0.3869228 0.2741825

and for all ages from 0 to 20,

http://f.hypotheses.org/wp-content/blogs.dir/253/files/2013/02/Capture-d%E2%80%99e%CC%81cran-2013-02-13-a%CC%80-16.02.55.png

For instance, for new cars, the proportion of fixed costs is rather small (here in purple), and keeps increasing with the age of the car. If the covariate is the density of population in the area the driver lives, we do obtain the following probabilities

> reg=multinom(tranches~bs(densite),data=couts)
# weights:  15 (8 variable)
initial  value 2113.730043 
iter  10 value 2068.469825
final  value 2068.466349 
converged
> predict(reg,newdata=data.frame(densite=90),type="probs")
    small     fixed     large 
0.3484422 0.3473315 0.3042263

http://f.hypotheses.org/wp-content/blogs.dir/253/files/2013/02/Capture-d%E2%80%99e%CC%81cran-2013-02-13-a%CC%80-16.05.29.png

Based on those probabilities, it is then possible to derive the expected cost of a claims, given some covariates (e.g. the density). But first, define subsets of the whole dataset

> sbaseA=couts[couts$tranches=="small",]
> sbaseB=couts[couts$tranches=="fixed",]
> sbaseC=couts[couts$tranches=="large",]

with a threshold given by

> (k=mean(sousbaseB$cout))
[1] 1171.998

Then, let us run our four models,

> reg=multinom(tranches~bs(densite),data=couts)
> regA=glm(cout~bs(densite),data=sousbaseA,family=Gamma(link="log"))
> regB=glm(cout~1,data=sousbaseB,family=Gamma(link="log"))
> regC=glm((cout-k)~bs(densite),data=sousbaseC,family=Gamma(link="log"))

We can now compute predictions based on those models,

> nouveau=data.frame(densite=seq(10,100))
> proba=predict(reg,newdata=nouveau,type="probs")
> predA=predict(regA,newdata=nouveau,type="response")
> predB=predict(regB,newdata=nouveau,type="response")
> predC=predict(regC,newdata=nouveau,type="response")+k
> pred=cbind(predA,predB,predC)

To visualize the impact of each component on the premium, we can compute probabilities, are well as expected costs (given a cost in each subset),

> cbind(proba,pred)[seq(10,90,by=10),]
       small     fixed     large    predA    predB    predC
10 0.3344014 0.4241790 0.2414196 423.3746 1171.998 7135.904
20 0.3181240 0.4471869 0.2346892 428.2537 1171.998 6451.890
30 0.3076710 0.4626572 0.2296718 438.5509 1171.998 5499.030
40 0.3032872 0.4683247 0.2283881 451.4457 1171.998 4615.051
50 0.3052378 0.4620219 0.2327404 463.8545 1171.998 3961.994
60 0.3136136 0.4417057 0.2446807 472.3596 1171.998 3586.833
70 0.3279413 0.4056971 0.2663616 473.3719 1171.998 3513.601
80 0.3464842 0.3534126 0.3001032 463.5483 1171.998 3840.078
90 0.3652932 0.2868006 0.3479061 440.4925 1171.998 4912.379

Now, it is possible to plot those figures in a graph,

> barplot(t(proba*pred))
> abline(h=mean(couts$cout),lty=2)

http://f.hypotheses.org/wp-content/blogs.dir/253/files/2013/02/Capture-d%E2%80%99e%CC%81cran-2013-02-15-a%CC%80-11.50.47.png

(the dotted horizontal line is the average cost of a claim, in our dataset).

Published at DZone with permission of Arthur Charpentier, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Claude Lalyre replied on Mon, 2013/02/25 - 8:00am

I guess Matlab is your friend...



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.