Can You Justify Your Big Data Calculations
Can you justify all the calculations in your system?
Imagine you went to a restaurant and were presented with a bill. However this bill just contains as single number - the total to pay. The bill is quite high but you did order the lobster... Do you trust the bill without a breakdown? Would you go back? Would you shop in a super-market that didn't give you an itemised receipt? I have to admit that I'm the kind of nerd that does check bills, but even if you aren't, it's reassuring that you can check them.
If we like to check where a result comes from them why do so many IT systems simply produce a number with no ability to show the inputs or calcuation? This feature is often referred to as a Drill Down.
You might be asking why I'm talking about this on an architectural blog - surely it's a simple feature to be added? Unfortunately, retrospectively adding the ability to drill down into a calcuation is often very, very hard.
Let's take a simple example: a running total such as 'how much have customers spent in my website today?' Running totals can be more problematic than at first glance. For example does your system:
- remove the amount for cancelled orders?
- remove orders where payment is refused?
- include pending payments in the total?
- deal with duplicate orders?
- include the total as part of the main transaction or can it be lost on system failure?
It's simpler to just run a report on demand which gives you a view at that point in time. Which, in reality, is how you'd deal with this simple example but, if the calculations or data requirements are more exacting, a running total might seem like a good idea.
If the number feels wrong then you'd want to drill into it. You might like to know:
- When the total was started (beginning of day/month/year)
- When the total ended (is this actually yesterdays number you're looking at)
- What products it covers
- What area/currency etc it covers
This can probably be shown from static data and doesn't involve a drill down into the calculation but if the user is still not satisified then they want:
- A list of all the transactions it covers.
This allows the users to actually reconcile against an external source (you know they'll want this data exported to excel!)
The simplest answer to producing this is to run a new report that gathers all the information that should be in the total. However, what if this new report has a total that differs from your live total? How do you find out where it went wrong? This becomes a very difficult debugging exercise and your user can lose trust with the system.
Once information is discarded you can't get it back so if you don't track the inputs to a calculation you can't guarantee what they were. You might be able to make a good approximation but ultimately it's a guess.
Questions to ask when designing or developing a system are:
- What data in your system is raw and what is derived?
- Does the derived data link back to it's source?
- Can you navigate from source to derived data and back if required?
- Do you keep all source/raw data?
- Do you actually need to derive the data/report in real time?
These questions are particularly important when you are dealing with huge quantities of real time data and most of what you store is derived or snapshots for a point in time. Most systems that connect to financial markets are of this type and you can't store everything. You have to determine what is relevant, keep this data and the links between them.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)