Big Data/Analytics Zone is brought to you in partnership with:

Nikita Ivanov is a founder and CEO if GridGain Systems – developer of one of the most innovative real time big data platform in the world. I have almost 20 years of experience in software development, a vision and pragmatic view of where development technology is going, and high quality standards in software engineering and entrepreneurship. Nikita is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

Five Words That Give Away Rotten Analytics Strategies

09.05.2012
| 2714 views |
  • submit to reddit

Over the last 12 months, I’ve had plenty of “conversations” about big data analytics and BI strategies with customers and potential users. The five words below represent tell-tale signs of decay in the field, summing up the current state of analytics/BI and demonstrating why it is in, by and large, a sorry state. (Beware: I'm going to use a measure of hyperbole to underline my point.)

“Batch”

This is probably obvious to most industry insiders, but it's worth mentioning: if you have a “batch” process in your big data analytics, you're not processing live data. You're not processing data in a real-time context. Period.

That means you're analyzing stale data, and your smarter, more agile competitors are running circles around you. They can analyze and process live (streaming) data in real-time and make appropriate operational BI decisions based on real-time analytics.

Using “batch” in your system design is like running your database off a tape drive. Would you do that when everyone around you is using disks?

“Data Scientist”

A bit controversial. But if you need one, your analytics/BI are probably not driving your business -- since you need a human body between your business and your data. Humans (who sadly need to eat and sleep) saddle any process with massive latency and non-real-time characteristics.

In most cases, needing a data scientist simply means:
  • The data you're collecting -- and the system collecting it -- are so messy that you need a Data Scientist (i.e. Statistician/Engineer under thirty) to clean it up
  • Your process is too hopelessly slow and clunky for real automation
  • Your analytics/BI is outdated by definition (i.e. analyzing stale data with no meaningful BI impact on daily operations)
Now, sometimes you need a domain expert to understand the data and help come up with some modeling, but I’ve yet to see a case complex enough that a four-year engineering degree in CS couldn't solve it. Most of the time, bringing in a data scientist is an overreaction/over-hire resulting from a poor understanding of the problem.

“Overnight”

The little brother of “Batch.” It is essentially a built-in failure for any analytics or BI. In the world of hyper-local advertising, geolocation, up-to-the-second updates on Twitter or Facebook or LinkedIn, you're the proverbial grandma driving a '66 Buick on the highway, turn-light blinking as everyone speeds past you…

There’s simply no excuse for having any type of overnight processing (except for some rare legacy financial applications). Overnight processing is not only a technical laziness but often a built-in organizational tenet -- and that’s what makes it even more appalling.

“ETL”

The little brother of “Overnight.” ETL is what many people blame for overnight processing. “Look, we’ve got to move this Oracle into Hadoop and it takes 6 hours, and we can only do it at night when no one is online.”

Well, I can only really count two or three clients of ours where no one is online during the night. This is 2012, for God’s sake! Most businesses -- even smallish startups -- are 24/7 operations these days.

ETL is the clearest sign of significant technical debt accumulation. It is, for the most part, indicative of a defensive and lazy approach to system design. It is especially troubling to see this approach in newer, younger companies that don’t have 25 years of legacy to deal with.

And it is equally invigorating to see it being steadily removed in companies with fifty years of history in IT.

“Petabyte”

This is a bit controversial, too. But I’m getting a bit tired of hearing, “We must design to process Petabytes of data” from companies with twenty employees.

Let me break it down:
  • 99.99% of companies will NEVER need Petabytes-scale
  • If your business “needs” to process Petabytes of data for its operations, you're likely doing something very wrong
  • Most of the “working sets” that we’ve seen, i.e. the data you really need to process, measure in low teens of terabytes for the absolute majority of use cases
  • Given how frequently data is changing (in its structure, content, usefulness, fresh-ness, etc.) I don’t expect that “working set” size will grow nearly as quickly (if at all) -- overall data amount will grow, but not the actual “window” that we need to process.
Yes: some companies and government organizations will have a need, for historical archival reasons, to store petabytes and exabytes of data -- but it’s for historical, archival and backup reasons in all of those rare cases, and likely never for frequent processing.
Published at DZone with permission of Nikita Ivanov, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)