2. P a g e 1 | 6
BIG DATA
We have heard the term “flood” been used with money, people or new
technologies. But this has led to define a new term which is related to flooding of
the data. Moore in 1965 described that number of transistors on dense IC
doubles approximately every two year. Which indeed is true and thus new
technologies which could fit in our hand and could be used as our personal
computer is invented. We are surrounded by electronic machine, we don’t realise
but we are monitored at some moment of our day by an electronic device that
can be our mobile phones, CCTV, weighting machine, computer and list never
ends. This in turns generate data and it so huge which made us define it as Big
Data.
Big Data is a vast repository of the data whose size is beyond the ability
of conventional data database. Size of such database cannot be defined; it is
growing at each second. For instance, famous social networking site facebook
collects 500+ terabytes of data every day. But this data is just collection of facts
and events of day to day life. These data doesn’t lie neither do it tell the truth.
We actually need to understand what can the data tells us. And that extracted
idea is called information. So information tells us that what does the data
means. This information is useless if we are not able to make use of it and use is
to change our action. So we need to provide insights that how will this
information be useful to achieve our goal.
Data have swept into every industry and business sectors. Mckinsey
Global Institute (MGI) estimated that an enterprise globally stores 7 exabytes of
data while consumer stores 6 exabytes in 2010. They have estimated that if US
health care use big data effectively, then potential value from data in this sector
could be more than $300 billion in value every year. European organisations
have approximately 11 exabytes of data, making efficient use of it can generate
nearly $149 billon in operational efficiency improvements. However in near term
there is huge potential to leverage big data in developing countries.
Organization can Leverage big data and improves its design and
functionality. Big data can create value in several ways; it can create
transparency by simply making data available to stakeholders in timely manner
which can create tremendous value. Also making data readily available to all the
departments can steeply reduce the search and processing time. It can also help
to experiment to discover the needs. As most of the data are stored in digital
format, one can discover if there is need or change that to be done on product.
Big data enables us to segment the populations according to ones need and to
help to deliver a customize action. It can also support in decision making with
automated algorithms. It can minimize risks, dig up the valuable Insight, for
instance tax agencies can use automated risk engines. Manufacturers are using
3. P a g e 2 | 6
data obtained from current product to improve their development of new ones.
Big data has created entirely new categories of companies, such as those that
aggregate and analyse industry data and provide useful information and insight
to manufacturing or financial company. Value of big data can be measured by
estimating the total value created from taking particular action with the use of
big data. But to capture full potential of big data several issues will have to be
addressed like; they have to consider the legal aspect in order to handle and
analyse data, there is no room for information breach which can result in serious
consequences. Organization which handles nation’s data need to be careful if
any of the information is exposed or not as this might have very high loss of the
nation. Also they have to be extremely careful when analysing any health care
data. As wrong prescription might cost someone’s life. Also company has to hire
new personnel’s who has understanding of big data.
Abundant variety of technologies have been develop that can be applied
to big data to get useful insight from it. Researchers continue to develop new
techniques to analyse the big data. Set of techniques that is used to extract
patterns from large database by combining methods from statistics, machine
learning and database management is called Data Mining. Technique which is
used is used to discover an interesting relationship among different variables in
large database is Association rule learning, which can be helped to determine
which products are frequently bought together. To study the buying strategy of a
customer or to determine the most consumed product them, a technique called
Classification is used. It categories the existing data and the new data can be
predicted using this already classified data, it is also called Unsupervised
Learning. Cluster Analysis is another statistical method that is used to group the
similar objects, whose characteristics of similarity is not known in advance. A
technique to collect large data from the crowd that are submitted through open
calls is known as Crowdsourcing. But analysing data from single source might
not be of great use. So it will be more efficient if multiple source is taken into
consideration, which is called Data fusion and data integration. Natural language
processing can be used to analyse data from social media websites like twitter,
facebook etc. Also idea of natural evolution that is “survival of the fittest” can be
used to optimize parameters of business or manufacturing models, called
Genetic Algorithm. Also a unique technique called spatial analysis is used to
analyse the geographic property which helps to make decision for selection of
manufacturing sites.
To support all the techniques of data mining many technologies have been
developed. Google developed Big Table to store data in compressed form on
Google File System. An application program which is used to report, analyse and
present data is Business Intelligence (BI). It reads data that have been
previously stored in a data warehouse and then enables to perform on it. Also
the computing paradigm which provide highly scalable computing resources
through the network is addressed as cloud computing. An open source software
4. P a g e 3 | 6
framework for processing huge datasets on distributed system is managed by
Apache Software Foundation which is named as Hadoop. Data can be structured
that is, data which resides in fixed fields like in spreadsheet. While I contrast,
there are unstructured data which includes free-form text, untagged audio,
image and video data. All the analyses and information from it will be in vain if
we are unable to present it to the people which should be easily consumable. So
proper visualization is the key challenge which needs to be met if proper action
is needed for the result of the analyses.
Earlier before the internet revolution, the methods used to mine data was
restricted to small data sets and less variability in datatypes. But in information
age, due to easier and less costly accumulation of data. It is estimated that
amount of information stored doubles every twenty months. So making effective
use of these data is new challenge we need to encounter. An automatic,
exploratory analysis and modelling of large data repositories is known as
Knowledge Discovery in Database (KDD). It is a novel approach to identify
understandable patterns in large data sets. The process of KDD starts with
determining the goal of particular project and ends with implementation of
discovered knowledge. KDD process is a nine-step process, starting with
managerial step. Firstly development of understanding of application domain is
carried out, it prepares scene for understanding what should be done with
available decision variables. People involved in this step need to understand the
requirement of end-user and the environment in which knowledge discovery will
take place. After that pre-processing need to be done of data sets. For this we
need to determine which all data are available and which will be used for
particular application. For success of the process we should consider all the
relevant data available because if some of the attributes are missed then whole
process may fail. Then we should clean the data that is, handling missing values
and removal of noise and outliers. This step will enhance our data reliability, to
give an example, if one suspects that a certain less important attribute is
unreliable or has lot of missing data then ignoring that attribute is smart choice.
But if that attribute is dominant for an application then we can make that
attribute as goal of data mining supervised algorithm and make a prediction of
missing variable. As certain attributes may not be useful or doesn’t affect the
goal of an application, but these variables may not be spotted by a person. So
after cleaning we will have data transformation step, this includes methods like
data reduction such as record sampling and feature selection, and attribute
transformation such as discretization of numerical attributes and functional
transformation. This step is crucial for success of entire KDD process but is
usually very project-specific. Having completed the above four step on data we
have to focus on algorithmic aspects of each project which are related to data
mining part. At first we need to choose appropriate Data Mining task, for
example is it regression, classification or clustering. There are two major goals in
Data Mining: prediction and description. Prediction is considered as supervised
data mining while description data mining includes unsupervised and
5. P a g e 4 | 6
visualization aspects. Most data mining techniques are based on inductive
learning, where model is constructed by generalizing from sufficient number or
training samples. Next we need to choose Data Mining algorithm, specific
method is to be chosen to search patterns. For example, precision is better with
Neural Network approach and to get understanding of attributes Decision trees is
better choice. Each algorithm has parameters and tactics of learning like cross-
validation or division of training and testing. Finally implementation of chosen
algorithm is employed, we might have to employ algorithm several times to get
satisfying results, for instance by tuning the algorithm’s control parameters.
Final step of data mining part is Evaluation and interpretation of mined patters,
with respect to goal defined in first step. Here we also pre-processing steps with
respect to their effects on result of data mining algorithm. Discovered knowledge
is also documented for further use. Lastly we need to make proper used of
discovered knowledge, the success of this step determines the effectiveness of
entire KDD process. As now we will use this results in real life, so many
challenges need to be considered like loosing the laboratory conditions under
which we have operated. For example, the knowledge was discovered from
certain snapshots (samples) but now data is dynamic.
Data Mining is classified into two subgroups: Verification and Discovery.
Discover methods are those that automatically identify hidden patterns in the
data. It is branched out as prediction and description. Description methods are
oriented to data interpretation, which focuses on understanding of way the data
relate to each other (for example by visualization). Prediction-oriented methods
aims to automatically build a behavioural model, which is capable of predicting
values of one or more variables related to samples and can obtain new and
unseen samples. It can help us provide understanding of the data. Verification
methods deals with evaluation of hypothesis proposed by an external source like
expert. Methods included are drawn out of traditional statistics like tests of
hypotheses (e.g., t-test of means), goodness of fit test and analysis of variance
(ANOVA). This methods are not related to data mining as most of the data
mining task are concerned with discovering a hypothesis (out of a very large set
of hypotheses), rather than testing which is already known. Under discovery
based methods, prediction is also called as supervised learning, as opposed to
unsupervised learning. Unsupervised learning generally maps high dimensional
data to reduced dimension. It groups data without prespecified, dependent
attributes. Unsupervised learning covers a portion of description method. For
instance it cover clustering methods (like K-means, K-medoids, Adaptive
Resonance Theory (ART) 2, etc.) but doesn’t cover visualization methods.
Supervised methods tries to discover the relationship between input attributes
and target attributes. It is useful to distinguish between two supervised models:
classification models and regression models. Regression maps input space into a
real-valued domain, for example, a regressor can predict the demand for a
certain product given its characteristics. On the other hand, classifier maps input
space into predefined classes.
6. P a g e 5 | 6
So we can say that Data Mining is a new science which consist of
techniques or methods developed using statistics, artificial intelligence, machine
learning and database systems.
Reforming the US health care service in order to reduce the rate at which
the cost have been increasing and to sustain its currency strength is critical to
United States both as society and as an economy. It is possible to address the
challenges faced by emulating and implementing best practices in health care
which may require to analyse large datasets. MGI have identified different
sectors through which US health care department can generate revenue and
bring down the spending on this department. One of them is developing
personalized medicine which will produce value in R&D arena. The goal of this
application is to examine the relationships among genetic variation,
predisposition for specific diseases, and specific drug responses and then to
account for the genetic variability of individuals in the drug development
process. Personalized medicine holds promise of improving health care in three
main ways: offering early detection and diagnosis; more effective therapies
because patients with same diseases often don’t respond in the same way to the
same therapy; and the adjustment of drug dosages according to a patient’s
molecular profile to minimize side effects and maximize response. Well but to
thoroughly understand the structure of any diseases, one need to consider all
the available clinical data which is very massive in amount. So this can be done
by proper modelling and efficiently applying data mining method. This new lever
was successful in early detection of breast cancer.
Governments in many parts of the world are under increasing pressure to
increase their productivity. Big data can offer them powerful arsenal of
strategies and techniques for boosting productivity and achieving higher level of
effectiveness. Public sector offer challenges because it is very diverse in its
functions and budgets. MGI focussed on administration in two types of
government agencies, tax and labour. So these agencies collects data on large
scale from different sectors. But it can face significant performance challenge.
For instance, Europe’s public sectors accounts for almost half of its GDP. This
high level of shares of economic output puts considerable long-term strain on
Europe’s budgetary. It has been estimated that by 2025 over 30 percent of
population in mature economy across the globe will be aged 60 or over and so
social security, health care, and pensions will face increasing demand.
As big data and its levers are becoming increasingly valuable assets, the
use of it will become key basis to compete across sectors. So it’s important for
organization leader to incorporate big data into their business plans. Also they
need to ensure that along with the sufficient skills in back-office analytics, they
also manage a transition towards the right managerial talent on front line.
Leader should understand the assets (i.e. data) they hold or which they could
have access. Organization should have inventory of their own and should also
systematically catalogue other data which they could gain access to, like
7. P a g e 6 | 6
government data, internet data. Also there might be third party who has not
consider to share their data. So organization need to thoughtfully consider and
present a compelling value to that party for able to gain access to their data.
Leader need to consider to adopting a process of purposeful experimentation
which can be powerful path to leverage big data, rather than just specifying
complete plan prior to doing any implementation. At first one can consider just
few high-potential areas in which to experiment with big data and then can be
scaled to larger domain. A sophisticated leader will first apply technique like
“scrubbing” on data which will generate, structure and organize the data, this
will improve its quality. Next these data should be made easily accessible to all
the departments of the organization through networks. Then very basic and
simple analytics will be applied on it, e.g., those techniques which doesn’t
require customized analyses to be designed by people with deep analytics skills.
Fourth and highest level is applying advanced and complex analytics like
automated algorithms and real-time data analysis that can create some new
business model. Leader should build a team with deep analytics capability which
will supply new information to the company and new insight for further business
growth. Also these leader will need to have baseline understanding of this
analytics techniques in order to become effective user of these types of
analyses. The lack of customer-centric view can limits the organization’s ability
to use any big data levers to create new value. So they might require to invest
in IT hardware, software, and services to capture, store, organize, and analyse
large datasets. Data privacy and security will become paramount as it travels
across boundaries for various purposes. Privacy, not only require to compliance
with laws and regulations, but also is fundamental to an organization’s trust
relationship with its customers and partners. Organizational leader will have to
wrestle with legal issues relating to their stance on intellectual property for data.
A significant constraint of realizing value of big data will be shortage of
talents, particularly people with expertise in statistics and machine learning. It’s
been estimated by MGI that demand of people with deep analytical talents in US
could be grater than 50-60 percent than its projected supply by 2018. It is
considered as 21st
century sexiest job, written by USA Today. Current trends
indicate that 4,000 new positions are being created annually, perhaps
significantly more. This has brought new wave in the market as most of the
sectors want to gain more from big data. Hal Varian, the chief economist at
Google, is known to have said, “The sexy job in the next 10 years will be
statisticians. People think I’m joking, but who would’ve guessed that computer
engineers would’ve been the sexy job of the 1990s?”