Big Data/BI Zone is brought to you in partnership with:

Mark is a graph advocate and field engineer for Neo Technology, the company behind the Neo4j graph database. As a field engineer, Mark helps customers embrace graph data and Neo4j building sophisticated solutions to challenging data problems. When he's not with customers Mark is a developer on Neo4j and writes his experiences of being a graphista on a popular blog at He tweets at @markhneedham. Mark is a DZone MVB and is not an employee of DZone and has posted 492 posts at DZone. You can read more from them at their website. View Full User Profile

Kaggit Digit Recognizer: A Feature Extraction Fail in R

  • submit to reddit

I’ve written a few blog posts about our attempts at the Kaggle Digit Recogniser problem and one thing we haven’t yet tried is feature extraction.

Feature extraction is where we’d generate some other features to train a classifier with rather than relying on just the pixel values we were provided.

Every week Jen would try and persuade me that we should try it out but it wasn’t until I was flicking through the notes from the Columbia Data Science class that it struck home:

5. The Space between the Data Set and the Algorithm

Many people go straight from a data set to applying an algorithm. But there’s a huge space in between of important stuff. It’s easy to run a piece of code that predicts or classifies. That’s not the hard part. The hard part is doing it well.

One needs to conduct exploratory data analysis as I’ve emphasized; and conduct feature selection as Will Cukierski emphasized.

I've highlighted the part of the post which describes exactly what we’ve been doing!

There were some examples of feature extraction on the Kaggle forums so I thought I’d try and create some other features using R.

I created features for the number of non zero pixels, the number of 255 pixels, the average number of pixels and the average of the middle pixels of a number.

The code reads like this:

initial <- read.csv("train.csv", header = TRUE)
initial$nonZeros <- apply(initial, 1, function(entries) length(Filter(function (x) x != 0, entries)))
initial$fullHouses <- apply(initial, 1, function(entries) length(Filter(function (x) x == 255, entries)))
initial$meanPixels <- apply(initial, 1, mean)
initial$middlePixels <- apply(initial[,200:500], 1, mean)

I then wrote those features out into a CSV file like so:

newFeatures <- subset(initial, select=c(label, nonZeros, meanPixels, fullHouses, middlePixels))
write.table(file="feature-extraction.txt", newFeatures, row.names=FALSE, sep=",")

I then created a 100 tree random forest using Mahout to see whether or not we could get any sort of accuracy using these features.

Unforutnately the accuracy on the cross validation set (10% of the training data) was only 24% which is pretty useless so it’s back to the drawing board!

Our next task is to try and work out whether we can derive some features which have a stronger correlation with the label values or combining the new features with the existing pixel values to see if that has any impact.

As you can probably tell I don’t really understand how you should go about extracting features so if anybody has ideas or papers/articles I can read to learn more please let me know in the comments!

Published at DZone with permission of Mark Needham, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)