2. 1. Introduction
The world is changing, and this is the digital ear. Almost everything around us is digitized and the flow of
information is huge from variety of sources ranging from mobile phone, smart devices, surveillance, sensors
of the universe, weather forecasting sensors, medical equipments, customers transactions of the internet, user
behaviors on the internet, and so on. This creates huge amount of data that have the sizes of terabytes to
petabytes and are on daily or weekly bases transactions. This data is called “Big Data” that provoked new
researches on the area of information analysis, structuring, and visualizing. One of the most dominant
successful methods to gain insight of this Big Data is Hadoop which the pioneers of data base management
systems are adopting with some added tools to deal with it and to gain valuable information from this data to
better understand it and consequently take proper actions and decisions based on this understanding. In the
following essay we are going to dig further on the definition of Big Data, the types and the benefits of it, the
challenges surrounds it, the techniques that are used so far to solve the challenges. Big Data is now becoming
the talk not of town, but the talk of the IT market and scientists which needs not only few pages to cover but
a PhD research to find better ways to solve the problem which is increasing day by day in the digitized era as
more digitized devices infiltrate in our daily life.
2. Types of big data
The Big Data has varying definitions; some define it as it is the greater volume of today’s data, the new
types of data and analysis, or the emerging requirements for more real-time information analysis1
. Others
argue that the “Big” is not a certain amount of data that could be predicted. The big nowadays may become
tomorrow a small, but at the end we can say according to some majority of researchers that the amounts of
data between Terabytes to Petabytes are considered as Big Data. Although this value may change over time
to take bigger numbers. Big Data generates value from the storage and processing of very large quantities of
digital information that cannot be analyzed with traditional computing techniques [1]. The Big Data has
variety of types that are classified as structured, unstructured data, text and multimedia [2]. The big data
types could be social media, web and software logs, cameras pictures and other log info, information-sensing
mobile devices, aerial sensory technologies, genomics, and medical records [6].
1
IBM Executive Report, reference 2.
3. 3. Benefits of big data
Many companies are looking at the Big Data as a source for better understanding and a facility to
predict the customer behaviors, and thus improve the customer experiences. The social media, the
transactions of different sources banks and others, the syndicated data through sources such as loyalty cards,
and other customer related information gave a valuable information for the companies to predict the
customer’s preferences and needs, in other words, building a long term businesses and customer services for
decades. By having this understanding, organizations of all types are finding new ways to connect with
existing and potential customers. This approach applies to small and enterprises such as in
telecommunications, healthcare, government, and banking and in business-to-business interactions among
partners and suppliers.
Many benefits of the Big Data include customer-centric objectives, and many functional objectives that are
being addressed through early applications of big data. Operational optimization, risk and financial
management, employee collaboration and enabling new business models are some of the benefits for both
the customer himself and the producer.
A released report on May 2011 by McKinsey says that leading companies are using big data analytics to gain
competitive advantage. Those companies forecast that a 60% margin increase for retail companies who are
able to harvest the power of big data [6]2
. Those companies perceived the importance of these huge amount
of data and released that it is now the time to take advantage of it [6].
4. Challenges in big data
Doug Laney is the first who developed the model of the Big Data that is described by the “3 Vs”, namely
the volume, velocity and variety. On the other hand, IBM added another V which is the veracity. Inclusion of
veracity as the fourth big data attribute emphasizes the importance of addressing and managing for the
uncertainty inherent within some types of data according to IBM [2]. I should here add that some researchers
call the three “Vs” as the “3Ss”, where the first is the source, as the variety, the second is the speed, as the
velocity, and the last is the size, which refers to the volume. In the following lines we are going to describe
the three challenges of the Big Data.
1. Volume
The huge amount of data that ranges between terabytes and petabytes is the main challenge that faces the
Big Data. Volume refers to the mass quantities of data that organizations are trying to exploit to improve
2
A document from Oracle.
4. decision-making across the enterprise. Data volumes continue to increase at an unprecedented rate according
to [2]. The traditional hardware and the relational database processing are incapable of handling many tasks
required by the Big Data. These tasks including modeling of climate on earth, predicting the weather
forecast, receiving and analyzing huge amount of data collected in hospitals of patients, diagnosing diseases,
gathering information from the galaxy, and so on.
2. Velocity
The amount of data flowing in every day to any enterprise is exponentially increasing beyond the
traditional Systems of storing and processing. Also the speed of the creation of data, processing and
analyzing it continues to progress at very high speed and therefore the data is always in motion since its
creation to it processing phase to the storage and retrieving phase [2].
It is also known that data streaming is becoming an essential of any internet activity for almost every user
of the systems nowadays even in mobile devices such as mobile phone or tablets. Nowadays, data is
continuously generated at a pace that is impossible for traditional systems to capture, store and analyze. The
online video, location tracking using GPS, or augmented reality among many applications depends on large
amounts of fast data streaming [1].
These services becoming a challenge for many organizations that needs to use new methods of delivering
these services in which the conventional methods are not suitable.
For time-sensitive processes such as real-time fraud detection or multi-channel “instant” marketing, certain
types of data must be analyzed in real time to be used in business decisions that gives the value for the
business to improve and elevate. We can say if there is velocity we should talk about the latency from which
the data is created till is accessed and analyzed [2].
3. Variety
It is simply referred to the different types of data and data sources. The data that is stored and processed
everyday has a variety of types. In the past the data that had to be processed were personal documents,
financial transactions, stock records, and so on. In the present, we have audio, video, graphics, 3D models,
location data and many complex data that needs to be stored, delivered, or processed. These unstructured Big
Data are therefore not easy to categorize with traditional methods of dealing with huge amounts of data. All
of these data are in reality messy and needs cleansing before any analysis to be applied [1].
We can simply say that variety is about managing the complexity of multiple data types, i.e. structured,
semi-structured and unstructured data. Organizations need to integrate and analyze data from both traditional
and non-traditional information sources, from within and outside the enterprise. With the expansion of using
sensors, smart phones and social collaboration technologies, data is generated in a variety of forms,
5. including: text, web data, tweets, sensor data, audio, video, click streams, log files and more as discussed in
the report from IBM [2].
4. Veracity
It refers to the data uncertainty and the level of reliability associated with certain types of data. One of the
critical requirements of Big Data is to have the quality, on the other hand, the available tools to purify the
data from its inherited unpredictability is not possible some examples like weather forecast, finance,
customer attitudes to buy, and so on [2]. In many organizations there are huge piles of data and in many
cases the managers themselves cannot trust the analysis of these data and this uncertainty is very important
for the Big Data to be understood from those managers to enable them to take the proper decisions in this
continual changing environment. Opportunities to use big data technology and analytics to improve decision-
making and performance exist in every industry and managers should be aware of these capabilities. We can
take example of the uncertainty of the Big Data in generating energy from natural resources. The amount of
data generated about the wind is huge, but still we cannot predict the full picture precisely as we cannot
predict the behavior of the weather, the winds and clouds. Despite that, there are still big amount of data that
can be valuable and useful to base decisions for future power production. So how do you plan if all these
uncertainties are in place? Analysts say through data fusion in which combining multiple less reliable
sources to create a more useful data point. An example would be the social comments appended to geospatial
location information. The other way to manage uncertainty is using the advanced mathematics such as fuzzy
logic and robust optimization techniques.
5. Technique/approach to overcome the challenges
The three “Vs”, namely volume, velocity and variety, are the main challenges for the Big Data and there is
a requirement for new technology away from the traditional methods used my Rational Database Systems
used today to overcome these challenges. One of the approaches used to overcome the issues of Big Data is
the Hadoop project which is an open source from Apache that was developed with software libraries which
provide reliability, scalability, and distributed system computing. This technology is able to handle the Big
Data processing and analytics. It is worth mentioning that Hadoop is widely used at large scale of most Big
Data pioneers such as LinkedIn that generates over 100 billion personalized recommendations every week as
mentioned in the source [1], and others like twitter as well.
To dig into further of the mechanism used with Hadoop, I am going to use simple explanation as follows.
The large data set are fragmented or divided into smaller sets, then it is scattered across cluster of servers to
do the computation using simple programming method. The number of servers may range from few
6. hundreds to around 2000 thousands or maybe more. The new thing with this computation method is that
Hadoop detects and compensates for any hardware failure at the application level whether the traditional
method depends on expensive servers. This guarantees the continuity of the services delivered in case of any
server failure in any of the clusters. In this case we distributed the computing capabilities among servers of
the mass data in a low-cost and effective way [1].
The two key elements of Hadoop are the Hadoop Distributed File System, HDFS, and the MapReduce.
The first allows for high bandwidth and the cluster based storage needed by Big Data processing. The second
is the data processing framework. The MapReduce is based on Google’s search technology that maps large
data sets across the cluster servers. The overall data set is processed in parts with each server and each server
is doing his part and then from this it creates a summary. All the summaries are aggregated to the “Reduce
stage”. In this way the data is pre-processed before applying traditional data analysis tools [3].
Let’s walk through the technical side a little bit. In the following illustration, we can say that Hadoop
consists of two parts, namely the HDFS and the MapReduce. The lower part layer consists of the name node
which stores the metadata or the info about the smaller actual data that are processed in the Data node. In the
higher layer there is the job tracker who decides what piece of data will run and where. The final part is the
task tracker, which runs the code [4].
Figure 1
7. Let us see the differences between the conventional way of processing data and the way it
Hadoop, the MapReduce. The following table shows the differences in terms of access, updates, structure,
integrity, and scaling.
We notice that the data is always moving and dynamic in the MapReduce and writing is discouraged,
that the data can scale to higher volumes.
Microsoft has adopted Hadoop with some modification to make it easy and user friendly interface and
added some connectors to it to make Microsoft
with Big Data are Powerview, PowerPivot in Excel, and sharepoint
Figure 2
Let us see the differences between the conventional way of processing data and the way it
Hadoop, the MapReduce. The following table shows the differences in terms of access, updates, structure,
We notice that the data is always moving and dynamic in the MapReduce and writing is discouraged,
that the data can scale to higher volumes.
Microsoft has adopted Hadoop with some modification to make it easy and user friendly interface and
added some connectors to it to make Microsoft-like product. Some of the tools that Microsoft uses to deal
th Big Data are Powerview, PowerPivot in Excel, and sharepoint [5].
Let us see the differences between the conventional way of processing data and the way it is used by
Hadoop, the MapReduce. The following table shows the differences in terms of access, updates, structure,
We notice that the data is always moving and dynamic in the MapReduce and writing is discouraged, and
Microsoft has adopted Hadoop with some modification to make it easy and user friendly interface and
like product. Some of the tools that Microsoft uses to deal
8. These are some of the tools that are
addition, Microsoft is implementing Hadoop on windows Azure and windows server as well. It created
JavaScript libraries and frame work for Hadoop and accomplished partnership with Hortonworks
Microsoft provided ODBC drivers and hive add
enable 3rd
party applications to be able to integrate with Hadoop on windows systems [4].
Microsoft is providing, in the following illustration
Figure 3
As it is shown in the diagram above, the data maybe structured data, (ERP, CRM, LOB, APPS), or
unstructured of different sources, (Sensors, Devices, Bots, Crawlers). It is stored in Enterprise data
Warehouse, if it is structured, or to be moved to the uppe
platform, Windows server or Azure. It is then processed using SQL Server Analysis Service or SQL Server
Reporting Service on the Business Intelligent platform, to be analyzed and to gain insight of all this mi
huge Data. At the end, the output is visualized by Excel PowerPivot, Power View, Predictive analytic tools,
or Embedded BI tools which all are Microsoft tools that the user is familiar with
3
The source is from Microsoft, reference [5].
ese are some of the tools that are usually used with BI to gain insight of the structured Data. In
addition, Microsoft is implementing Hadoop on windows Azure and windows server as well. It created
JavaScript libraries and frame work for Hadoop and accomplished partnership with Hortonworks
Microsoft provided ODBC drivers and hive add-in for excel to deal with Big Data. The ODBC drivers
party applications to be able to integrate with Hadoop on windows systems [4].
the following illustration, solution for the Big Data.
As it is shown in the diagram above, the data maybe structured data, (ERP, CRM, LOB, APPS), or
unstructured of different sources, (Sensors, Devices, Bots, Crawlers). It is stored in Enterprise data
Warehouse, if it is structured, or to be moved to the upper layer to be processed with Hadoop on windows
platform, Windows server or Azure. It is then processed using SQL Server Analysis Service or SQL Server
Reporting Service on the Business Intelligent platform, to be analyzed and to gain insight of all this mi
huge Data. At the end, the output is visualized by Excel PowerPivot, Power View, Predictive analytic tools,
BI tools which all are Microsoft tools that the user is familiar with3
.
from Microsoft, reference [5].
usually used with BI to gain insight of the structured Data. In
addition, Microsoft is implementing Hadoop on windows Azure and windows server as well. It created
JavaScript libraries and frame work for Hadoop and accomplished partnership with Hortonworks. Moreover,
in for excel to deal with Big Data. The ODBC drivers
party applications to be able to integrate with Hadoop on windows systems [4].
As it is shown in the diagram above, the data maybe structured data, (ERP, CRM, LOB, APPS), or
unstructured of different sources, (Sensors, Devices, Bots, Crawlers). It is stored in Enterprise data
r layer to be processed with Hadoop on windows
platform, Windows server or Azure. It is then processed using SQL Server Analysis Service or SQL Server
Reporting Service on the Business Intelligent platform, to be analyzed and to gain insight of all this mixed
huge Data. At the end, the output is visualized by Excel PowerPivot, Power View, Predictive analytic tools,
9. Oracle is also among the pioneers that are developing methods to solve the Big Data issue. They have
developed Oracle Big Data Connectors, Oracle Loader for Hadoop, Oracle Data Integrator [6]. In addition
some statistical and analysis capabilities like Open Source Project R and Oracle R Enterprise are developed
to take advantage of Hadoop capabilities. Oracle looks to the traditional data and created the tools to
facilitate understanding it and gain insight. In the following is traditional data from Oracle perspective.
Figure 4
Oracle added new mechanism using the Hadoop technology and their preparatory analytics and BI tools to
deal with Big Data. See the following figure.
Figure 5
Many of Big Data pioneers deploy the old and new data in parallel that is using the Hadoop alongside
with the traditional way. It is also expected that Hadoop will replace other data processing methods and be
the dominant solution for Big Data.
Big Data will progress as artificial intelligence advances, and as new types of computer processing power
become available such as quantum computing which uses quantum mechanical states and is expected to
excel theoretically the parallel processing of unstructured data [3]. There are other technologies used in big
data include massively parallel-processing (MPP) databases, search-based applications, data-mining grids,
distributed file systems , distributed databases, cloud based infrastructure. Almost all of these technologies
are not new but there are enhancements in using it is used with Big Data.
10. The Big Data requires a high speed transaction, analysis, and retrieving of data therefore it needs such high
capacity hard-drives as SATA drives and/or high speed storage disks such as the Solid State disks, SSD,
which are memory- based hard disks. These storage systems are inside the parallel processing nodes used
with Big Data.
6. Conclusion
In the past decade, the information became a dominant factor in our daily life. Everything surrounding use
is digitized and the data kept progressing and moving all the time. The huge amounts of data that is
continuously moving and changing became unpredictable and not easy to understand as it is not organized in
a way that we can take the benefit of. The main interest in the past for companies is to take whatever
information about customer behavior, or take as much data as the medical equipment can take, or collect as
much information from the galaxy as the sensor can take but we come to the question “what are we going to
do with all these piles of data?” Now we come to an era that companies need to take advantage of all these
data in a way that we can take the insight and the value of it. Hadoop is now the major player that Google
started to build it algorithm in 2004 with its open source. The pioneers of dealing and processing the
databases are now in fast race to adopt and integrate their preparatory analytical and BI tools such as
Microsoft and Oracle or IBM. The race is still continuing and the core of all these development is using the
parallel processing using Hadoop.
We do not know yet what the future is hiding for dealing with the Big Data, is it going to be solved using
the new processing algorithms or new hardware adoption using the latest technologies. Is it going to be the
issue that comes in front of the queue before the cloud computing? Is the artificial intelligence going to play
any role in developing a dynamic algorithm that can cope with the dynamic fast moving data? The question
is widely open and future is expected to bring more to us. What’s important is that the key information
architecture principles are the same, but the tactics of applying these principles differ from one company to
another. We should look to Big Data as an asset that will bring better future for use if we perfectly gain the
insight of it.
7. References
[1]http://www.explainingcomputers.com. (n.d.). Retrieved June 2, 2013, from
http://www.explainingcomputers.com: ] http://www.explainingcomputers.com/big_data.html
[2](2013). Retrieved June 4, 2013, from http://public.dhe.ibm.com/:
http://public.dhe.ibm.com/common/ssi/ecm/en/gbe03519usen/GBE03519USEN.PDF
[3]http://en.wikipedia.org. (2013). Retrieved May 29, 2013, from http://en.wikipedia.org:
http://en.wikipedia.org/wiki/Big_data
11. [4]https://www.youtube.com. (n.d.). Retrieved May 29, 2013, from https://www.youtube.com:
https://www.youtube.com/watch?v=HM0YX7mpplk
[5]http://download.microsoft.com/download/F/A/1/FA126D6D-841B-4565-BB26-
D2ADD4A28F24/Microsoft_Big_Data_Solution_Brief.pdf
[6]http://www.oracle.com. (n.d.). Retrieved June 4, 2013, from
http://www.oracle.com/technetwork/topics/entarch/articles/oea-big-data-guide-1522052.pdf