We have a lot more storage space available these days, and a lot more
data to work with, so Big Data and Big Analytics is getting much more
mainstream now. And there are conclusions and insights you can get from
that data, any data more or less, but Web data in particular brings a
new dimension when combined with more traditional, domain specific data.
But this data is also mostly in the shape of plain text, like your
blogs, twitters, news articles and other web content. And this in turn
means that to combine your organized structures sales data for 20 years
with Web data, the Web data first needs to be analyzed.
Web data also brings in a new difficulty: the data is big
it's not organized at it's core, so you can not easily aggregate or
something like that to save space (and why would you want to do that?).
It's not until after you have analyzed it that you know what data is
interesting and what is not. And to be frank (but I am not, I'm Anders),
not even then can you start to aggregate data or throw data away that
isn't interesting. And in my mind, this is an mistake that has been done
in all sorts of analytics, even with smaller amounts of data.
When it comes to analytics, in my mind "If you think you have all the right answers, you haven't asked all the right questions"
This is an important point, analytics is a recurring activity, and the
more questions you get answered, the more questions you should get. And
with this in mind, how can you know what to aggregate? In particular
when it comes to web content?
So, can we live with Web data not being aggregated and how do we do it?
What database can support that? Oracle? MySQL? MongoDB? Vertica? And the
answer is, in the same way as with analytics, you will not know when
you start analyzing, and once you have started doing that, you will be
even more in doubt! Which technology supports all the aspects you might
need to look at? And the keyword is might
So, how can we solve this? And my answer is: By using the right tool for
the job at hand, and be prepared to combine different tools! Postgres
and Oracle are great for temporal analysis, for GIS we have Oracle,
MySQL and PostGIS. For handling large amounts of data with good
scalability and keeping the cost down, you might want a key-value store
like MongoDB or DynamoDB. To search data you might head for Sphinx or
Lucene. Etc. etc.
As an example, I'd might want to look at a key-value store for my raw
Web data, holding some key for easy lookup. An RDBMS for the attributes
of this data. Sphinx for searching it. Sphinx and Lucene are much better
tools than your average RDBMS, be it MySQL or Oracle or Whatever, and
RDBMS search is different than a text search in web-data!
So the most important aspect to look at, if you ask me, is to choose
technologies that can easily be combined and where different aspects of
data can be served by different technologies as appropriate. And be
prepared to add, remove and change technologies as you go along!