Wednesday, August 6, 2014

Is all the Big Data usable?

So much is happening related to Big Data in almost every industry. A general conception of Big Data (or rather difficult data) is putting a lot of onus on collecting data. Data has been generated since years but with spreading awareness of its power, the storage and recording of data is getting more and more crucial. The most basic step to any analysis will be the pool of data and companies realizing that are making efforts to store at least their internal data, if not external data (related to their clients). For an aviation company, one flight of half an hour can generate 30 terabytes of data (related to just technical flight operations). Imagine the volume of data they will and can have to look at.

But is all the Big Data usable? The answer should be NO. Data is getting influenced (polluted) and its must to have pure and definite data for the analysis, otherwise it can generate ambiguous outputs. Although the research techniques and decision methodologies take care of accounting for this, I believe a systematic approach is always needed to choose & filter Data (or at least data source).

Choose a source relevant to your goal: It’s important to choose a source which is relevant to your research. This source can be one of the factors on which your goal or analysis directly relies on and not indirectly. Suppose we want to know how sales of a children’s shoe company is doing. On observing another set of data for the sale of children’s clothing, we might see a correlation. However, it’s very important to understand that Sales of shoes is not directly correlated to that of clothing but both depend on (may be) number of children entering schools or their age.

Influenced data (Polluted) data:  Data now-a-days is available from many sources, however not all these data can be used for analysis. For example some quotes on blogs can be influenced and biased and using such data can always present a major threat to authenticity and accuracy of your analysis. Many advertisements in TV and Newspapers also provide some figures (such as internet runs x% faster on our network). Even these data should be scrutinized. One needs to ask faster than which network? Are there any proofs for these claims? If yes, what conditions apply? Etc.

Channels for Data collection: Often the channel for data collection is not correct. You must have observed many people approaching you i
n a mall or a public place to fill up a short survey. Many a times if these channels are not trusted they can manipulate data or simply extrapolate a small set, so as to save on their effort. Choosing a trusted vendor is crucial to this. In case, someone is trying a vendor for the first time, one must spot check and validate the survey works. This can be done with the help of an external or third party agent or agency.


Time Window and data punctuality: Often data is valid only for a small period. The data collection schedule should take into account this factor. A dedicated timeline for collection of data and a strict adherence to it can help curb this. One more point to note here is that the data source must also provide the current data and not obsolete one. This factor may not be applicable to all but for many analyses, this can be really crucial.


The data collection is the most fundamental step to any analysis and doing it in a perfect way can improve accuracy of your outputs (and hence decision insight) significantly.