Big Data/Analytics Zone is brought to you in partnership with:

Mark is a graph advocate and field engineer for Neo Technology, the company behind the Neo4j graph database. As a field engineer, Mark helps customers embrace graph data and Neo4j building sophisticated solutions to challenging data problems. When he's not with customers Mark is a developer on Neo4j and writes his experiences of being a graphista on a popular blog at http://markhneedham.com/blog. He tweets at @markhneedham. Mark is a DZone MVB and is not an employee of DZone and has posted 544 posts at DZone. You can read more from them at their website. View Full User Profile

Data Science: Don't Filter Data Prematurely

02.19.2013
| 2051 views |
  • submit to reddit

Last year I wrote a post describing how I’d gone about getting data for my ThoughtWorks graph and one mistake about my approach in retrospect is that I filtered the data too early.

My workflow looked like this:

  • Scrape internal application using web driver and save useful data to JSON files
  • Parse JSON files and load nodes/relationships into neo4j

The problem with the first step is that I was trying to determine up front what data was useful and as a result I ended up running the scrapping application multiple times when I realised I didn’t have all the data I wanted.

Since it took a couple of hours to run each time it was tremendously frustrating but it took me a while to realise how flawed my approach was.

For some reason I kept tweaking the scrapper just to get a little bit more data each time!

It wasn’t until Ashok and I were doing some similar work and had to extract data from an existing database that I realised the filtering didn’t need to be done so early in the process.

We weren’t sure exactly what data we needed but on this occasion we got everything around the area we were working in and looked at how we could actually use it at a later stage.

Given that it’s relatively cheap to store the data I think this approach makes sense more often than not – we can always delete the data if we realise it’s not useful to us at a later stage.

It especially makes sense if it’s difficult to get more data either because it’s time consuming or we need someone else to give us access to it and they are time constrained.

If I could rework that work flow it’d now be split into three steps:

  • Scrape internal application using web driver and save pages as HTML documents
  • Parse HTML documents and save useful data to JSON files
  • Parse JSON files and load nodes/relationships into neo4j

I think my experiences tie in reasonably closely with those I heard about at Strata Conf London but of course I may well be wrong so if anyone has other points of view I’d love to hear them.

Published at DZone with permission of Mark Needham, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)