[tweetmeme source=”atripathy” only_single=false] Lately, I have been thinking about the entire big data trend. Fundamentally, it makes sense to me and I believe it is useful for some enterprise class problems, but something about it had been troubling me and I decided to take some time and jot down my thoughts. As I thought more about it, I realized my core issue is associated with some of the over simplified rhetoric that I hear about what big data can do for businesses. A lot of it is propagated by speakers/companies at big name conferences and subsequently echoed by many blogs and articles. Here are the 3 main myths that I regularly hear:
1. More data = More insights
An argument which I have heard a lot is that with enough data, you are more likely to discover patterns and facts and insights. Moreover, with enough data, you can discover patterns and facts using simple counting that you can’t discover in small data using sophisticated statistical methods.
My take:
It is true but as a research concept For businesses the key barrier is not the ability to draw insights from large volumes of data, it is asking the right questions for which they need an insight. It is not never wise to generalize the usefulness of large datasets since the ability to provide answers will depend on the question being asked and the relevance of the data to the question.
2. Insights = Actionability = Decisions
It is almost an implicit assumption that insights will be actionable and since they are actionable business decisions will be made based on them.
My take:
There is a huge gap between insights and actionability. Analysts always find very interesting insights but a tiny fraction of it will be actionable, especially if one has not started with a very strong business hypothesis to test.
Even more dangerous is the assumption, that because an insight is actionable, an executive will make the decision to implement it. Ask any analyst who has worked in a large company and he /she will tell you that realities of business context and failure of rational choice theory stand in the way of a lot of good actionable insights turning into decisions.
3. Storing all data forever is a good thing
This is the Gmail pitch. Enterprises do not have to decide which data they need to store and what to purge. They can and should store everything because of Myth 1. More data means more insights and competitive advantage. Moreover, storage is cheap so why would you not store all data forever.
My take:
Remember the backlash against Gmail which did not have a delete button when it started. The fact is there is a lot of useless data which increases noise to signal ratio. Enterprises struggle with data quality issues and storing everything without any thought to what data is more useful for which kind of questions does more harm than good. Business centric approaches to data quality and data architecture have a significant payoff for downstream analytics and we should give them their due credit when we talk about big data.
In summary,
1. There is a lot of headroom left for small data insights that enterprises fail to profit from.
2. There are indeed some very interesting use cases for big data which are useful for enterprises (even the non-web related ones)
3. But the hype and the oversimplification of the benefits without thoughtful consideration of issues and barriers will eventually lead to disappointment and disillusion in the short run.
Some interesting perspectives on the topic: James Kobielus , Rama Ramkrishnan
I would like to submit the following refinement of the big data concept.
Assume you are a business and in your daily process data is generated to support the at that point actionable data stream. Say you operate like that for two years, at which point you realize that there has been no correlation between your actionable data stream and your top three key performance metrics. Now what do you do? If you threw away data that didn’t pertain to the BKMs of that time you might have thrown away the real actionable information that would have allowed you to discover what the business processes were that really would help the business. That is how I interpret the “more data generates more insight” concept.
Twenty years ago when a GByte on a filer was a big deal, we constantly fought with this problem. We would spent 6 months generating data and calibrating models and then we would get the round trip done where we tested our predictions, we would learn that we should be tracking different metrics. Having to throw away data because you can’t store it made learning much more difficult because you couldn’t go back and test if your new insights would have produced better results. This lead to this constant churn where we had to redo experiments or simply would not be able to relate new findings to the past.
What we do now with our data sets is we define the decision universe as broadly as we understand the problem. Our data generators will err on creating too much information, for the simple reason that now we can store a couple hundred TBs without trouble. The actionable decision processes will grab from this data set and data mining algorithms continue to analyze the raw data to see if there are better metrics hidden in the data. Once you really understand your decision universe and the business processes that it can affect, you can scrub the data. We still prefer to compress/dedup/archive it so that we can generate a decade or more of operational data. The ability to back test is so valuable for insight.
Totally agree with you and your use case.
In your example you have a business problem that you are trying to solve, a defined set of data, a process to systematically test on a continuous basis and measurable metrics you want to influence. You are certainly using more data to your advantage.
However, when I hear about big data in most cases the starting point is the data and not the problem/metric, which I contend is not the best place to start.
Amaresh:
Totally agree with your assessment: starting with raw data is asking for trouble.
The way we have set up our analytics pipelines is that they always work towards a decision point and then we have a feedback mechanism to test in the future if our decision was the right one. If it turns out to have been the wrong decision, we can go back and figure out what we did wrong in our analysis. For real-time decision making it is typically a bit easier to be disciplined as the analytics coding is all done to trigger some test. For knowledge discovery I believe you need two ingredients:
1- critical mass of statistical thinking, and
2- deep statistical skills with a solid historical perspective
Critical mass is needed so that you don’t have one person in the corner generating reports that the rest of the organization simply ignores. Deep skills and historical perspective are needed to properly direct the proper algorithms and sentiment. I truly believe in the value of historical perspective as it grounds the analytics with a richer context.
By focusing on a decision it is easier to properly allocate the right analytical resources, although that is still a very difficult process if you don’t have infinite skills and resources.
Excellent observations, Amaresh and @deepanalytics.
An implicit assumption behind a lot of the breathless commentary in the blogosphere seems to be this: if you can unearth an insight from data and present it to the client/customer/end-user, they will immediately embrace it, act on it, get value from it, and thank you profusely for it.
In reality, the response from a customer is more like: “oh great. one more so-called insight. now i have to worry about what the heck to do with this darn thing, on top of all the stuff that’s part of my day job. thanks a lot!”
What customers need are decision recommendations with supporting evidence, not “insights”.
@deepanalytics
I like the way you talk about real time decision making and knowledge discovery
@Rama, really liked the way you put it.
One more point to add to this discussion is that despite all the analytical insights that may be produced out of data, someone may neglect to perform a “sanity check” on whether this is the right insight for us. We may end up creating an answer in search of a problem. Therefore I think it is really important for business to ask the right questions all the way through the process of generating analytical insights.