Big Data/Analytics Zone is brought to you in partnership with:

Mark is a graph advocate and field engineer for Neo Technology, the company behind the Neo4j graph database. As a field engineer, Mark helps customers embrace graph data and Neo4j building sophisticated solutions to challenging data problems. When he's not with customers Mark is a developer on Neo4j and writes his experiences of being a graphista on a popular blog at http://markhneedham.com/blog. He tweets at @markhneedham. Mark is a DZone MVB and is not an employee of DZone and has posted 528 posts at DZone. You can read more from them at their website. View Full User Profile

R: Building Up a Data Frame Row by Row

02.21.2013
| 1604 views |
  • submit to reddit

Jen and I recently started working on the Kaggle Titanic problem and we thought it’d probably be useful to start with some exploratory data analysis to get a feel for the data set.

For this problem you are given a selection of different features describing the passengers on board the Titanic and you have to predict whether or not they would have survived or died based on those features.

I thought an interesting first thing to look at would be the survival rate for passengers of different socio economic status as you would imagine that people of a higher status were more likely to have survived.

The data looks like this (it looks a lot clearer when you click the "View Source" button):

> head(titanic)
  survived pclass                                                name    sex age sibsp parch           ticket    fare cabin embarked
1        0      3                             Braund, Mr. Owen Harris   male  22     1     0        A/5 21171  7.2500              S
2        1      1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0         PC 17599 71.2833   C85        C
3        1      3                              Heikkinen, Miss. Laina female  26     0     0 STON/O2. 3101282  7.9250              S
4        1      1        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0           113803 53.1000  C123        S
5        0      3                            Allen, Mr. William Henry   male  35     0     0           373450  8.0500              S
6        0      3                                    Moran, Mr. James   male  NA     0     0           330877  8.4583              Q

I started by writing the following function to work out how many people in a particular class survived:

survived <- function(class) {
  peopleInThatClass <- titanic[titanic$pclass == class,]
  survived <- peopleInThatClass[peopleInThatClass$survived == 1,]  
  nrow(survived)
}

I could then work out how many people in each class survived by using lapply:

> lapply(c(1,2,3), survived)
[[1]]
[1] 136
 
[[2]]
[1] 87
 
[[3]]
[1] 119

This works but the output isn’t very nice to read and I wanted to put that data side by side with other information about eeach class for which I figured I’d need to construct a data frame.

I came across a solution which works quite well about half way down a StackOverflow question and reworked my code to make use of that:

survivalSummary <- function(classes) {
  summary<-NULL
  for(class in classes) {
    numberSurvived <- survived(class)
    summary <- rbind(summary,data.frame(class=class, survived=numberSurvived))
  }
 
  summary[with(summary, order(class)),]  
}
> survivalSummary(c(1,2,3))
  class survived
1     1      136
2     2       87
3     3      119

We make use of the rbind function which creates a new data frame with a row appended. We then have to reassign the new data frame to the summary variable.

If we want to add other information about the class onto each row such as the total number of passengers and the survival rate it’s very easy:

survivalSummary <- function(classes) {
  summary<-NULL
  for(class in classes) {
    numberSurvived <- survived(class)
    total <- passengers(class)
    percentageSurvived <- numberSurvived / total
    summary <- rbind(summary,data.frame(class=class, survived=numberSurvived, total=total, percentSurvived=percentageSurvived))
  }
 
  summary[with(summary, order(class)),]  
}
> survivalSummary(unique(titanic$pclass))
  class survived total percentSurvived
2     1      136   216       0.6296296
3     2       87   184       0.4728261
1     3      119   491       0.2423625

We’re also now using unique to get the different classes rather than hard coding them.

Published at DZone with permission of Mark Needham, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)