# Regression on Categorical Variables

This morning, Stéphane asked me tricky question about extracting coefficients from a regression with categorical explanatory variates. More precisely, he asked me if it was possible to store the coefficients in a nice table, with information on the variable and the modality (those two information being in two different columns). Here is some code I did to produce the table he was looking for, but I guess that some (much) smarter techniques can be used (comments – see below – are open). Consider the following dataset

> base x sex hair 1 1 H Black 2 4 F Brown 3 6 F Black 4 6 H Black 5 10 H Brown 6 5 H Blonde

with two factors,

> levels(base$hair) [1] "Black" "Blonde" "Brown" > levels(base$sex) [1] "F" "H"

Let us run a (standard linear) regression,

> reg=lm(x~hair+sex,data=base)

which is here

> summary(reg) Call: lm(formula = x ~ hair + sex, data = base) Residuals: 1 2 3 4 5 6 -3.714e+00 -2.429e+00 2.429e+00 1.286e+00 2.429e+00 -2.220e-16 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.5714 3.4405 1.038 0.408 hairBlonde 0.2857 4.8655 0.059 0.959 hairBrown 2.8571 3.7688 0.758 0.528 sexH 1.1429 3.7688 0.303 0.790 Residual standard error: 4.071 on 2 degrees of freedom Multiple R-squared: 0.2352, Adjusted R-squared: -0.9121 F-statistic: 0.205 on 3 and 2 DF, p-value: 0.886

If we want to extract the names of the factors (assuming here that there are no *numbers *in the name of the factor), and the values of the associated modality, one can use

> VARIABLE=c("",gsub("[-^0-9]", "", names(unlist(reg$xlevels)))) > MODALITY=c("",as.character(unlist(reg$xlevels))) > names=data.frame(VARIABLE,MODALITY,NOMVAR=c( + "(Intercept)",paste(VARIABLE,MODALITE,sep="")[-1])) > regression=data.frame(NOMVAR=names(coefficients(reg)), + COEF=as.numeric(coefficients(reg))) > merge(names,regression,all.x=TRUE) NOMVAR VARIABLE MODALITE COEF 1 (Intercept) 3.5714286 2 hairBlack hair Black NA 3 hairBlonde hair Blonde 0.2857143 4 hairBrown hair Brown 2.8571429 5 sexF sex F NA 6 sexH sex H 1.1428571

or, if we want modalities exluding references,

> merge(names,regression) NOMVAR VARIABLE MODALITE COEF 1 (Intercept) 3.5714286 2 hairBlonde hair Blonde 0.2857143 3 hairBrown hair Brown 2.8571429 4 sexH sex H 1.1428571

In order to reproduce the table Stéphane sent me, let us use the following code to produce an html table,

> library(xtable) > htlmtable <- xtable(merge(names,regression)) > print(htlmtable,type="html")

NOMVAR | VARIABLE | MODALITY | COEF | |
---|---|---|---|---|

1 | (Intercept) | 3.57 | ||

2 | hairBlonde | hair | Blonde | 0.29 |

3 | hairBrown | hair | Brown | 2.86 |

4 | sexH | sex | H | 1.14 |

So yes, it is possible to build a table with the variable, modalities, and coefficients. This function can be interesting on prospective mortality, when we do have a large number of modalities per factor (years, ages and year of birth). Consider the following datasets

> DEATH=read.table( + "http://freakonometrics.free.fr/DeathsSwitzerland.txt", + header=TRUE,skip=2) > EXPOSURE=read.table( + "http://freakonometrics.free.fr/ExposuresSwitzerland.txt", + header=TRUE,skip=2) > DEATH$Age=as.numeric(as.character(DEATH$Age)) > DEATH=DEATH[-which(is.na(DEATH$Age)),] > EXPOSURE$Age=as.numeric(as.character(EXPOSURE$Age)) > EXPOSURE=EXPOSURE[-which(is.na(EXPOSURE$Age)),] > base=data.frame(y=as.factor(DEATH$Year),a=as.factor(DEATH$Age), + c=as.factor(DEATH$Year-DEATH$Age),D=DEATH$Total,E= EXPOSURE$Total) > base=base[base$E>0,]

and the following nonlinear model, based on Lee-Carter model (including a cohort effect),

can be estimated using

> library(gnm) > reg=gnm(D~a+Mult(a,y)+Mult(a,c),offset=log(E),family=poisson,data=base)

In order to extract the 671 coefficients from the regresssion,

> length(coefficients(reg)) [1] 671

(as properly as possible) we have to be careful: names of coefficients are not that simple to handle. For instance, we can see things like

> coefficients(reg)[200] Mult(., year).age98 0.04203519

In order to extract them, define

> na=length((reg$xlevels)$age) > ny=length((reg$xlevels)$year) > nc=length((reg$xlevels)$cohort) > VARIABLElong=c("",rep("age",na),rep("Mult(., year).age",na), + rep("Mult(a, .).y",ny), + rep("Mult(., cohort).age",na),rep("Mult(age, .).cohort",nc)) > VARIABLEshort=c("",rep("age",na),rep("age",na),rep("year",ny), + rep("age",na),rep("cohort",nc)) > MODALITY=c("",(reg$xlevels)$age,(reg$xlevels)$age, + (reg$xlevels)$year,(reg$xlevels)$age,(reg$xlevels)$cohort) > names=data.frame(VARIABLElong,VARIABLEshort, + MODALITY,NOMVAR=c("(Intercept)",paste(VARIABLElong,MODALITY,sep="")[-1])) > regression=data.frame(NOMVAR=names(coefficients(reg)), + COEF=as.numeric(coefficients(reg)))

Here we go, now we have the coefficients from the regression in a nice table,

> outputreg=merge(names,regression) > outputreg[1:10,] NOMVAR VARIABLElong VARIABLEshort MODALITY COEF 1 (Intercept) -8.22225458 2 age1 age age 1 -0.87495451 3 age10 age age 10 -1.67145704 4 age100 age age 100 4.91041650 5 age11 age age 11 -1.00186990 6 age12 age age 12 -1.05953497 7 age13 age age 13 -0.90952859 8 age14 age age 14 0.02880668 9 age15 age age 15 0.42830738 10 age16 age age 16 1.35961403

It is now possible to plot all the coefficients, as functions of the age, the year of observation, or the year of birth. For instance, for the *standard *average age effect (namely as a function of ), we can use

> typevariable=as.character(unique(outputreg$VARIABLElong)) > basegraph=outputreg[outputreg$VARIABLElong==typevariable[2],] > x=as.numeric(as.character(basegraph$MODALITY)) > y=basegraph$COEF > plot(x,y,type="p",col="blue",xlab="Age")

while the cohort effect ( as a function of ) is obtained using

> basegraph=outputreg[outputreg$VARIABLElong==typevariable[5],] > x=as.numeric(as.character(basegraph$MODALITY)) > y=basegraph$COEF > plot(x,y,type="p",col="blue",xlab="Cohort (year of birth)",ylim=c(0,10))

*(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)*