So much is happening related to Big Data in almost every industry.
A general conception of Big Data (or rather difficult data) is putting a lot of
onus on collecting data. Data has been generated since years but with spreading
awareness of its power, the storage and recording of data is getting more and
more crucial. The most basic step to any analysis will be the pool of data and companies
realizing that are making efforts to store at least their internal data, if not
external data (related to their clients). For an aviation company, one flight
of half an hour can generate 30 terabytes of data (related to just technical
flight operations). Imagine the volume of data they will and can have to look
at.
But is all the Big Data usable? The answer should be NO. Data
is getting influenced (polluted) and its must to have pure and definite data
for the analysis, otherwise it can generate ambiguous outputs. Although the research
techniques and decision methodologies take care of accounting for this, I believe
a systematic approach is always needed to choose & filter Data (or at least
data source).
Choose a source
relevant to your goal: It’s important to choose a source which is relevant
to your research. This source can be one of the factors on which your goal or
analysis directly relies on and not indirectly. Suppose we want to know how
sales of a children’s shoe company is doing. On observing another set of data
for the sale of children’s clothing, we might see a correlation. However, it’s
very important to understand that Sales of shoes is not directly correlated to
that of clothing but both depend on (may be) number of children entering
schools or their age.
Influenced data
(Polluted) data: Data now-a-days is
available from many sources, however not all these data can be used for analysis.
For example some quotes on blogs can be influenced and biased and using such
data can always present a major threat to authenticity and accuracy of your
analysis. Many advertisements in TV and Newspapers also provide some figures
(such as internet runs x% faster on our network). Even these data should be
scrutinized. One needs to ask faster than which network? Are there any proofs
for these claims? If yes, what conditions apply? Etc.
Channels for Data
collection: Often the channel for data collection is not correct. You must
have observed many people approaching you i
n a mall or a public place to fill
up a short survey. Many a times if these channels are not trusted they can
manipulate data or simply extrapolate a small set, so as to save on their
effort. Choosing a trusted vendor is crucial to this. In case, someone is trying
a vendor for the first time, one must spot check and validate the survey works.
This can be done with the help of an external or third party agent or agency.
Time Window and data
punctuality: Often data is valid only for a small period. The data collection
schedule should take into account this factor. A dedicated timeline for
collection of data and a strict adherence to it can help curb this. One more
point to note here is that the data source must also provide the current data
and not obsolete one. This factor may not be applicable to all but for many
analyses, this can be really crucial.