Big Data/BI Zone is brought to you in partnership with:

Arthur Charpentier, ENSAE, PhD in Mathematics (KU Leuven), Fellow of the French Institute of Actuaries, professor at UQàM in Actuarial Science. Former professor-assistant at ENSAE Paritech, associate professor at Ecole Polytechnique and professor assistant in economics at Université de Rennes 1. Arthur is a DZone MVB and is not an employee of DZone and has posted 145 posts at DZone. You can read more from them at their website. View Full User Profile

Regression Tree Using Gini’s Index

01.29.2013
| 1409 views |
  • submit to reddit

In order to illustrate the construction of regression tree (using the CART methodology), consider the following simulated dataset,

> set.seed(1)
> n=200
> X1=runif(n)
> X2=runif(n)
> P=.8*(X1<.3)*(X2<.5)+
+   .2*(X1<.3)*(X2>.5)+
+   .8*(X1>.3)*(X1<.85)*(X2<.3)+
+   .2*(X1>.3)*(X1<.85)*(X2>.3)+
+   .8*(X1>.85)*(X2<.7)+
+   .2*(X1>.85)*(X2>.7) 
> Y=rbinom(n,size=1,P)  
> B=data.frame(Y,X1,X2)

with one dichotomos varible (the variable of interest, ), and two continuous ones (the explanatory ones  and ).

> tail(B)
    Y        X1        X2
195 0 0.2832325 0.1548510
196 0 0.5905732 0.3483021
197 0 0.1103606 0.6598210
198 0 0.8405070 0.3117724
199 0 0.3179637 0.3515734
200 1 0.7828513 0.1478457

The theoretical partition is the following

Here, the sample can be plotted below (be careful, the first variate is on the y-axis above, and the x-axis below) with blue dots when  equals one, and red dots when  is null,

> plot(X1,X2,col="white")
> points(X1[Y=="1"],X2[Y=="1"],col="blue",pch=19)
> points(X1[Y=="0"],X2[Y=="0"],col="red",pch=19)

In order to construct the tree, we need a partition critera. The most standard one is probably Gini’s index, which can be writen, when ‘s are splited in two classes, denoted here 

//perso.univ-rennes1.fr/arthur.charpentier/latex/arbre-comp-04.png” ne peut être affichée car elle contient des erreurs.

or when ‘s are splited in three classes, denoted 
http://perso.univ-rennes1.fr/arthur.charpentier/latex/arbre-comp-07.png

etc. Here,  are just counts of observations that belong to partition  such that  takes value . But it is possible to consider other criteria, such as the chi-square distance,

http://perso.univ-rennes1.fr/arthur.charpentier/latex/arbre-comp-01.png

where, classically

http://perso.univ-rennes1.fr/arthur.charpentier/latex/arbre-comp-02.png when we consider two classes (one knot) or, in the case of three classes (two knots) http://perso.univ-rennes1.fr/arthur.charpentier/latex/arbre-comp-05.png

Here again, the idea is to maximize that distance: the idea is to discriminate, so we want samples as not independent as possible. To compute Gini’s index consider

> GINI=function(y,i){
+ T=table(y,i)
+ nx=apply(T,2,sum)
+ pxy=T/matrix(rep(nx,each=2),2,ncol(T))
+ vxy=pxy*(1-pxy)
+ zx=apply(vxy,2,sum)
+ n=sum(T)
+ -sum(nx/n*zx)
+ }

We simply construct the contingency table, and then, compute the quantity given above. Assume, first, that there is only one explanatory variable. We split the sample in two, with all possible spliting values , i.e.

Then, we compute Gini’s index, for all those values. The knot is the value that maximizes Gini’s index. Once we have our first knot, we keep it (call it, from now on ). And we reiterate, by seeking the best second choice: given one knot, consider the value that splits the sample in three, and give the highest Gini’s index, Thus, we consider either the following partition

or this one

I.e. we cut either below, or above the previous knot. And we iterate. The code can be something like that,

> X=X2
> u=(sort(X)[2:n]+sort(X)[1:(n-1)])/2
> knot=NULL
> for(s in 1:4){
+ vgini=rep(NA,length(u))
+ for(i in 1:length(u)){
+ kn=c(knot,u[i])
+ F=function(x){sum(x<=kn)}
+ I=Vectorize(F)(X)
+ vgini[i]=GINI(Y,I)
+ }
+ plot(u,vgini)
+ k=which.max(vgini)
+ cat("knot",k,u[k],"\n")
+ knot=c(knot,u[k])
+ u=u[-k]
+ }
knot 69 0.3025479 
knot 133 0.5846202 
knot 72 0.3148172 
knot 111 0.4811517

At the first step, the value of Gini’s index was the following,

which was maximal around 0.3. Then, this value is considered as fixed. And we try to construct a partition in three parts (spliting either below or above 0.3). We get the following plot for Gini’s index (as a function of this second knot)

 which is maximum when the split the sample around 0.6 (which becomes our second knot). Etc. Now, let us compare our code with the standard R function,

> tree(Y~X2,method="gini")
node), split, n, deviance, yval
      * denotes terminal node

 1) root 200 49.8800 0.4750  
   2) X2 < 0.302548 69 12.8100 0.7536 *
   3) X2 > 0.302548 131 28.8900 0.3282  
     6) X2 < 0.58462 65 16.1500 0.4615  
      12) X2 < 0.324591 7  0.8571 0.1429 *
      13) X2 > 0.324591 58 14.5000 0.5000 *
     7) X2 > 0.58462 66 10.4400 0.1970 *

We do obtain similar knots: the first one is 0.302 and the second one 0.584. So, constructing tree is not that difficult…

Now, what if we consider our two explanatory variables? The story remains the same, except that the partition is now a bit more complex to write. To find the first knot, we consider all values on the two components, and again, keep the one that maximizes Gini’s index,

> n=nrow(B)
> u1=(sort(X1)[2:n]+sort(X1)[1:(n-1)])/2
> u2=(sort(X2)[2:n]+sort(X2)[1:(n-1)])/2
> gini=matrix(NA,nrow(B)-1,2)
> for(i in 1:length(u1)){
+ I=(X1<u1[i])
+ gini[i,1]=GINI(Y,I)
+ I=(X2<u2[i])
+ gini[i,2]=GINI(Y,I)
+ }
> mg=max(gini)
> i=1+sum(mg==max(gini[,2]))
> par(mfrow = c(1, 2))
> plot(u1,gini[,1],ylim=range(gini),col="green",type="b",xlab="X1",ylab="Gini index")
> abline(h=mg,lty=2,col="red")
> if(i==1){points(u1[which.max(gini[,1])],mg,pch=19,col="red")
+          segments(u1[which.max(gini[,1])],mg,u1[which.max(gini[,1])],-100000)}
> plot(u2,gini[,2],ylim=range(gini),col="green",type="b",xlab="X2",ylab="Gini index")
> abline(h=mg,lty=2,col="red")
> if(i==2){points(u2[which.max(gini[,2])],mg,pch=19,col="red")
+          segments(u2[which.max(gini[,2])],mg,u2[which.max(gini[,2])],-100000)}
> u2[which.max(gini[,2])]
[1] 0.3025479

The graphs are the following: either we split on the first component (and we obtain the partition on the right, below),

or we split on the second one (and we get the following partition),

Here, it is optimal to split on the second variate, first. And actually, we get back to the one-dimensional case discussed previously: as expected, it is optimal to split around 0.3. This is confirmed with the code below,

> library(tree)
> arbre=tree(Y~X1+X2,data=B,method="gini")
> arbre$frame[1:4,]
     var   n       dev      yval splits.cutleft splits.cutright
1     X2 200 49.875000 0.4750000      <0.302548       >0.302548
2     X1  69 12.811594 0.7536232      <0.800113       >0.800113
4 <leaf>  57  8.877193 0.8070175                               
5 <leaf>  12  3.000000 0.5000000

For the second knot, four cases should be considered: spliting on the second variable (again), either above, or below the previous knot (see below on the left) or spliting on the first one. Then whe have wither a partition below or above the previous knot (see below on the right),

Etc. To visualize the tree, the code is the following

> plot(arbre)
> text(arbre)
> partition.tree(arbre)

http://f.hypotheses.org/wp-content/blogs.dir/253/files/2013/01/arbre-gini-x1-x2-encore.png

Note that we can also visualize the partition. Nice, isn’t it?

To go further, the book Classification and Regression Trees by Leo Breiman (and co-authors) is awesome. Note that there are also interesting sections in the bible Elements of Statistical Learning: Data Mining,Inference, and Prediction by Trevor Hastie, Robert Tibshirani and Jerome Friedman (which can be downloaded from http://www.stanford.edu/~hastie/…)

Published at DZone with permission of Arthur Charpentier, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)