Feeds:
Posts
Comments

Archive for the ‘Data’ Category

[tweetmeme source=”atripathy” only_single=false] Lately, I have been thinking about the entire big data trend. Fundamentally, it makes sense to me and I believe it is useful for some enterprise class problems,  but something about it had been troubling me and I decided to take some time and jot down my thoughts.  As I thought more about it, I realized my core issue is associated with some of the over simplified rhetoric that I hear about what big data can do for businesses. A lot of it is propagated by speakers/companies at big name conferences and subsequently echoed by many blogs and articles. Here are the 3 main myths that I regularly hear:
1. More data = More insights
An argument which I have heard a lot is that with enough data, you are more likely to discover patterns and facts and insights. Moreover, with enough data, you can discover patterns and facts using simple counting that you can’t discover in small data using sophisticated statistical methods.

My take:
It is true but as a research concept For businesses the key barrier is not the ability to draw insights from large volumes of data, it is asking the right questions for which they need an insight. It is not never wise to generalize the usefulness of large datasets since the ability to provide answers will depend on the question being asked and the relevance of the data to the question.

2. Insights = Actionability = Decisions
It is almost an implicit assumption that insights will be actionable and since they are actionable business decisions will be made based on them.

My take:
There is a huge gap between insights and actionability.  Analysts always find very interesting insights but a tiny fraction of it will be actionable, especially if one has not started with a very strong business hypothesis to test.

Even more dangerous is the assumption, that because an insight is actionable, an executive will make the decision to implement it. Ask any analyst who has worked in a large company and he /she will tell you that realities of business context and failure of rational choice theory stand in the way of a lot of good actionable insights turning into decisions.

3. Storing all data forever is a good thing
This is the Gmail pitch. Enterprises do not have to decide which data they need to store and what to purge. They can and should store everything because of Myth 1. More data means more insights and competitive advantage. Moreover, storage is cheap so why would you not store all data forever.

My take:
Remember the backlash against Gmail which did not have a delete button when it started. The fact is there is a lot of useless data which increases noise to signal ratio. Enterprises struggle with data quality issues and storing everything without any thought to what data is more useful for which kind of questions does more harm than good. Business centric approaches to data quality and data architecture have a significant payoff for downstream analytics and we should give them their due credit when we talk about big data.

In summary,

1. There is a lot of headroom left for small data insights that enterprises fail to profit from.
2. There are indeed some very interesting use cases for big data which are useful for enterprises (even the non-web related ones)
3. But the hype and the oversimplification of the benefits without thoughtful consideration of issues and barriers will eventually lead to disappointment and disillusion in the short run.

Some interesting perspectives on the topic: James Kobielus , Rama Ramkrishnan

Advertisement

Read Full Post »

Interesting stories on information & decisions that influenced my thinking and were tweeted by me:

  1. Extending the value of operational #data to serve your customers. Netflix ISP comparison http://tinyurl.com/4whjd3o
  2. Coal economics and computer chips. Demand driver for need for more #analytics http://bit.ly/f6Hme2
  3. Excellent paper for #analytics practitioners on customer lifetime value and RFM models http://bit.ly/dkdkaa
  4. #visualization of loobying efforts http://reporting.sunlightfoundation.com/lobbying/
  5. #Analytics of dating http://blog.okcupid.com/
  6. LinkedIn’s @PeteSkomoroch on the key skills that data scientists need. http://oreil.ly/hXZTVJ

Read Full Post »

Theme 3: Integrating third-party data into predictive analysis

[tweetmeme source=”atripathy” only_single=false]This is the third installment of the eight part series on predictive analytics (see part 1, part 2).

Perhaps one of the most significant opportunities for organizations using predictive analytics is incorporating new relevant third-party data into their analysis and decision-making process.  Investment in a targeted and relevant dataset generates far greater returns than spending time in developing sophisticated models without the right dataset.

Suppose a wedding gown retailer wants to pursue a geographical expansion strategy. How would they determine where to open new stores? For that matter how should they evaluate the performance of existing stores?  Should a store in Chicago suburbs produce the volume of business as a store in Austin?

To answer the above questions, you need a lot of data that is not within the organization’s firewalls. One will need to know where people are getting married (demand in a market), how many competitor stores sell wedding gowns in the same area (competitive intensity in a market), how far potential brides are willing to travel to buy a wedding gown (real estate costs in city vs. suburbs will be vastly different), income and spend profile of people in the market (how much are customers willing to spend)

Marriage registration data from NCHS, socio-demographic data from a company like Claritas or US census, business data from Dun & Bradstreet or InfoUSA, cost data for real estate and maybe a custom survey data of potential brides should all be input variables into the store location analysis. Data about existing store sales and customer base are important, but they tell only part of the story and do not provide the entire context to make the right decisions.

Using the above data the retailer will be able to identify favorable markets with higher volumes and growth in marriages and appropriate competitive profiles. It can also use existing store performance data to rank the favorable markets using a regression or cluster analysis and then corroborate the insights using mystery shopping or a survey. Such a data driven methodology represents a quantum improvement over how new store locations are identified and evaluated.  While the datasets are unique to the problem, I find that such opportunities exist in every organization. A clear framing of the problem, thinking creatively about the various internal or external data, and targeted analysis leading to significantly better solutions is what information advantage is all about.

We are in the midst of an open data movement, with massive amounts of data being released by the government under the open government directive. Private data exchanges are being set up by Microsoft, InfoChimps among others. Not to mention all the new types of data now available (e.g., twitter stream data). Companies that build capabilities to identify, acquire, cleanse and incorporate various external datasets into their analysis will be well positioned to gain the information advantage

Read Full Post »

[tweetmeme source=”atripathy” only_single=false]Earlier this week, I suggested a potential business application of IRS’s internal migration data for a moving and relocation company.

Folks at Neilsen Claritas just found a far more interesting correlation which should have driven a lot of business decisions.  They note:

Today’s presence of underwater mortgages, or homes with negative equity, seem to be correlated to two common regional U.S. population trends: 1) domestic immigration from the Northeastern region to the South and Southwestern regions of the U.S., and 2) migration from coastal California inland

While such retrospective analysis is interesting for reports and blogs, it is not particularly useful for businesses. Maybe as means to generate  interesting hypothesis for future. It would have been useful had the chart been available to the strategic planning or risk group of businesses signing up people for these housing loans in 2006 and 2007.

Data is valuable only when it is used to drive decisions. Most companies have a huge opportunity to do a better job in bringing together data, analytics and visualization and delivering them to the points of decision.

Read Full Post »

[tweetmeme source=”atripathy” only_single=false]One subject that has not received a lot of coverage in the analytics blogging circle is the current administration’s data.gov project. While still in its infancy the data.gov is an outcome of the government’s transparency initiative called Open Government Directive. In December, all government agencies were asked to produce and publish three new ‘high value’ public data feeds on data.gov website.

The data.gov site still has to work through some of the kinks but eventually it will become a wonderful resource for the data analytics industry. Probably as critical as the US Census data and its American Factfinder tool, which has spawned multiple companies and supports all kinds of interesting analysis across a wide range of industries.

The Sunlight Foundation tracks the new datasets that are being released. For example one of the Labor departments datasets is the “weekly reports of fatalities, catastrophes and other events.” The data, compiled by the Occupational Safety and Health Administration, briefly describes workplace accidents, identifies the company at which and the date when the accident occurred. I think a lot of  insurance companies with worker compensation insurance products will be interested in analyzing the data to better price their products. Or take for instance the IRS internal migration data by state and county based on tax returns. Can it be used by moving companies to better understand the shift in demand for their services? There are thousands of such datasets available, and a lot of them will potentially be valuable to businesses. The value of a dataset to business like beauty, is in the eyes of the beholder. This makes the categorization challenging but at the same time makes it interesting for businesses as it can be a potential source of competitive advantage. If you can figure out to interpret the IRS migration data to better align your marketing campaigns for your moving and relocation assistance business, you can get better return on investment on your spend than your competition.

It is time for organizations to look outside their firewalls and build a strategy of collecting, incorporating and analyzing external data into their analytics and strategic planning efforts.  Companies like Infochimps, which is a private clearinghouse and market place for third-party data are betting on this trend. They already collect, cleanse and  format the data.gov datasets so that it is analysis ready.

Take out the time to check the datasets that are available. You never know what you may find.

Read Full Post »

[tweetmeme source=”atripathy” only_single=false]The theme of this blog is to understand how actionable information in form of decision support tools will lead to next wave of efficiencies and competitive advantage. However, the reverse is probably more stark. Not investing in the table stakes data aggregation and reporting process capabilities can also hurt, and it can hurt big time.

The financial crisis in Greece is a case study on how easy money and uncontrolled government spending during boom time can come back to hurt in a weak economy. However, one of the confounding factors has been the Greek government’s repeated revisions of its budget deficit data. In 2008, it reported the deficit to be 5.0% of their GDP in April. Later that year they revised it up to 7.7%. Similarly, in 2009 April, the official forecast figure for the deficit was 3.7% of the GDP which was later revised to 12.5% of GDP. It is the last revision that started the full blown crisis.

Digging a little bit deeper, it is easy to discover that one of the key reasons for revisions. It is the lack of a modern budgetary process and financial reporting system.

Past budgets have rested on some 14,000 separate expenditure lines. This year’s has brought the figure down to about 1,000. In this system, the evaluation of public spending in any particular area is almost impossible. The amount spent on education, for example, is defined as the total sum of money allocated to the Ministry of Education and it is very difficult to monitor where it goes. Currently, most of Greece’s 15 ministries and dozens of other government bodies handle their own payroll accounts, making it difficult to gain a complete overview of government spending.

No wonder, they could not trace reliably how much money was being spent!

Last year, the Greek government had also approached the OECD to conduct a study and recommend improvements in its budgetary processes, and one of the recommendations was around managing the deployment of the new accounting and financial information system.

Ill-defined processes and weak information management systems tend to exist in certain quarters of most organizations. The key question to ask yourself is whether this under-investment in information systems:
1) exposes you to a big risk
2) makes you inefficient or
3) prevents you from gaining some potential competitive advantage?

Read Full Post »

[tweetmeme source=”atripathy” only_single=false]I believe that the next wave of productivity improvement in companies is going to come from investment in decision support tools.

In our economy almost all workers are now knowledge workers. However unlike the workers of the industrial era, we still do not have the right set of tools to support our knowledge workers. We live in an era of information overload where employees increasingly need to make faster and more complex decisions using large amounts of available data. Under such circumstances, making informed, let alone optimal decisions is simply not humanly possible.

This creates a need for a range of new tools for the employees. Tools that will guide the decision making process and where ever appropriate automate them. To create such tools, companies will need to create expertise in four foundational areas:

1. Data: Identifying, collecting, managing and curating the data within the company and relevant third party sources
2. Analytics: Creating a scalable process to turn data to relevant insights and recommendations
3. Visualization: Presenting the insights within the appropriate context to support a decision
4. Integration: Bringing all of the pieces together to make the recommendation/insight available at the point of decision in the workflow of the employee

Companies will become better at bringing together the four foundational areas. We will also see increased activity in the vendor space. The established ones becoming more active in acquiring companies in the value chain. And a range of startups who will rush into the space to fill the gap.

For now, the blog is about tracking this trend.

Read Full Post »