This document provides an overview of big data, including its definition, origins, characteristics, importance, and opportunities and challenges. It describes big data as large volumes of diverse data that require new technologies and techniques to capture, curate, manage and process within a tolerable time. Big data is characterized by its volume, velocity and variety. Analyzing big data can provide benefits such as cost reductions, time reductions, new product development and smart decision making. It also discusses storing, processing and analyzing data at the edge of networks.
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage processing of big data, applications, adv,dis etc..)
1. BIG DATA
Table of content
1.Introduction.....................................................................(2)
2.What is Big Data……………….............................................(3)
3.Origin of the concept......................................................(3-4)
4.Big Data- Basic concept..................................................(5)
5.What is big data? 3vs of Big Data? ................................(6-7)
6.Big Data verses Small
Data…………………………………………(8)
7.Why Big Data is important..............................................(9)
8.Big Data at the Edge........................................................(9)
9.Big Data-Opportunities and Challenges..........................(10)
10.Big Data Storage............................................................(11)
11.Big Data Processing.......................................................(11-12)
12.Advantages and disadvantages of Big Data..................(13)
13.Applications of Big Data................................................(13)
14.Conclusion.....................................................................(14)
15.References……………………………………………… (15-16)
2. Abstract
Big data is an all-encompassing term for any collection of data sets so large and
complex that it becomes difficult to process them using traditional data processing
applications. Big data usually includes data sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage, and process data within a
tolerable elapsed time. Big data size is a constantly moving target ranging from a few
dozen terabytes to many peta-bytes of data. Big data is a set of techniques and
technologies that require new forms of integration to uncover large hidden values from
large datasets that are diverse, complex, and of a massive scale. Big data can also be
defined as "Big data is a large volume unstructured data which cannot be handled by
standard database management systems like DBMS, RDBMS or ORDBMS" .This report
focus on management and processing of Big Data that will combine business
requirements and utilize the services platform to analyze the dataset.
3. Introduction:
People, devices and networks are constantly generating data. When users stream
videos, play the latest game with friends, or make in-app purchases, their activity
generates data about their needs and preferences, as well as their QoE. Even when
users put their devices in their pockets, the network is generating location and other
data that keeps services running and ready to use.
As a result, the rate of mobile network data traffic growth is increasing rapidly. It is
estimated that by 2020, the number of smartphone subscriptions will have increased
from today’s 2.7 billion to 6.1 billion, and the total amount of mobile traffic generated by
smartphones will be five times that of today.
The big-data-driven telecom analytics market alone is expected to have a compound
annual growth rate of nearly 50 percent – with annual revenues expected to reach USD
5.4 billion at the end of 2019.
Communication service providers (CSPs) can make use of this big data to drive a wide
range of important decisions and activities. These include: designing more competitive
offers and packages; recommending the most attractive offers to subscribers during the
shopping and ordering process; communicating with subscribers about their usage,
spending and purchase options; configuring the network to deliver more reliable
services; and monitoring QoE to proactively correct any potential problems. All these
activities enable improved user experience, increased customer satisfaction, smarter
networks and extended network functionality to facilitate progress into the Networked
Society.
The profound impact that increased broadband networking will have on society will also
create business opportunities in new areas for CSPs. Improved real-time connectivity
and data management enables the creation of tailored data sets, readily available for
analysis and machine learning. This enables data-driven efficiency improvements in
several business areas – for example, transport, logistics, energy, agriculture and
environmental protection. Furthermore, decision making in business and society will be
facilitated by access to insights based on more accurate and up-to-date data.[1]
4. What is Big Data?
Big Data is the ocean of information we swim in every day – vast sources of data
flowing from our computers, mobile devices, and machine sensors. Big Data is being
generated by everything around us at all times. Every digital process and social media
ex-change produces it, while systems, sensors, and mobile de-vices transmit it. New
sources of data come from a variety of ma-chines, such as website inter-actions, search
engine optimizations, and social business sites by using click-stream data. These
changing business requirements demand that the right information be available at the
right time.[3]
Origins of the concept:
A decade ago, data storage scalability was one of the major technical issues data
owners were facing. Nevertheless, a new brand of efficient and scalable technology has
been incorporated and data management and storage is no longer the problem it used
to be.In addition, data is constantly being generated, not only by use of internet, but also
by companies generating big amounts of information coming from sensors, computers
and automated processes. This phenomenon has recently accelerated further thanks to
the increase of connected devices (which will soon become the largest source of data)
and the worldwide success of the social platforms.
Significant Internet players like Google, Amazon, Facebook and Twitter were the first
facing these increasing data volumes “at the internet scale” and designed ad-hoc
solutions to be able to cope with the situation.Those solutions have since, partly
migrated into the open source software communities and have been made publicly
available. This was the starting point of the current Big Data trend as it was a relatively
cheap solution for businesses confronted with similar problems.Meanwhile, two parallel
breakthroughs have further helped accelerate the adoption of solutions for handling Big
Data:
The availability of Cloud based solutions has dramatically lowered the cost of storage,
amplified by the use of commodity hardware. Virtual file systems, either open source or
vendor specific, helped transition from a managed infrastructure to a service based
approach;
When dealing with large volumes of data, it is necessary to distribute data and
workload over many servers. New designs for databases and efficient ways to support
massively parallel processing have led to a new generation of products like the so
called noSQL databases and the Hadoop map-reduce platform.
The table below summarizes the main features and problems connected to handing
different types of large data sets, and explains how Big Data technologies can help
solve them.[2]
5. Aspect Characteristcs Challenges and Technology
responses
Volume
The most visible aspect of Big
Data, referring to the fact that
the amount of generated data
has increased tremendously
the past years. However, this is
the less challenging aspect in
practice.
The natural expansion of
internet has created an
increase in the global data
production. A response to this
situation has been the
virtualization of storage in data
centres, amplified by a
significant decrease of the cost
of ownership through the
generalization of the cloud
based solutions.The noSQL
database approach is a
response to store and query
huge volumes of data heavily
distributed.
Velocity
This aspect captures the
growing data production
rates.More and more data are
produced and must be
collected in shorter time
frames.
*
The daily addition of millions of
connected devices
(smartphones) will increase not
only volume but also velocity.
Real-time data processing
platforms are now considered
by global companies as a
requirement to get a
competitive edge
Variety
With the multiplication of data
sources comes the explosion of
data formats, ranging from
structured information to free
text.
*
The necessity to collect and
analyse non-structured or semi-
structured data goes against
the traditional relational data
model and query languages.
This reality has been a strong
incentive to create new kinds of
data stores able to support
flexible data models
*
Value
This highly subjective aspect
refers to the fact that until
recently, large volumes of data
where recorded (often for
archiving or regulatory
purposes) but not exploited*
Big Data technologies are now
seen as enablers to create or
capture value from otherwise
not fully exploited data. In
essence, the challenge is to
find a way to transform raw
data into information that has
value, either internally, or for
making a business out of it.* [2]
6. Big Data: Basic Concept:
Big Data encompasses everything from click stream data from the web to genomic and
proteomic data from biological research and medicines. Big Data is a heterogeneous
mix of data both structured (traditional datasets –in rows and columns like DBMS tables,
CSV's and XLS's) and unstructured data like e-mail attachments, manuals, images,
PDF documents, medical records such as x-rays, ECG and MRI images, forms, rich
media like graphics, video and audio, contacts, forms and documents. Businesses are
primarily concerned with managing unstructured data, because over 80 percent of
enterprise data is unstructured and require significant storage space and effort to
manage.“Big data” refers to datasets whose size is beyond the ability of typical
database software tools to capture, store, manage, and analyse.
Big data analyticsis the area where advanced analytic techniques operate on big data
sets.[4]
Fig 1. Big Data
7. What is big data? 3vs of Big Data?
By now, it’s almost impossible to not have heard the term Big Data- a cursory glance at
Google Trends will show how the term has exploded over the past few years, and
become unavoidably ubiquitous in public consciousness. But what you may have
managed to avoid is gaining a thorough understanding what Big Data actually
constitutes.
The first go-to answer is that ‘Big Data’ refers to datasets too large to be processed on a
conventional database system. In this way, the term Big Data is nebulous- whilst size is
certainly a part of it, scale alone doesn’t tell the whole story of what makes Big Data
‘big’.
When looking for a slightly more comprehensive overview, many refer to Doug Laney’s
3 V’s:
1. Volume
100 terabytes of data are uploaded daily to Facebook; Akamai analyses 75 million
events a day to target online ads; Walmart handles 1 million customer transactions
every single hour. 90% of all data ever created was generated in the past 2 years.
Scale is certainly a part of what makes Big Data big. The internet-mobile revolution,
bringing with it a torrent of social media updates, sensor data from devices and an
explosion of e-commerce, means that every industry is swamped with data- which can
be incredibly valuable, if you know how to use it.
2. Velocity
In 1999, Wal-Mart’s data warehouse stored 1,000 terabytes (1,000,000 gigabytes) of
data. In 2012, it had access to over 2.5 petabytes (2,500,000 gigabytes) of data.
Every minute of every day, we upload 100 hours of video on Youtube, send over 200
million emails and send 300,000 tweets. ‘Velocity’ refers to the increasing speed at
which this data is created, and the increasing speed at which the data can be processed,
stored and analysed by relational databases. The possibilities of processing data in
real-time is an area of particular interest, which allows
companies to do things like display personalised ads on the web pages you visit, based
on your recent search, viewing and purchase history.
3. Variety
Gone are the days when a company’s data could be neatly slotted into a table and
analysed. 90% of data generated is ‘unstructured’, coming in all shapes and forms- from
8. geo-spatial data, to tweets which can be analysed for content and sentiment, to visual
data such as photos and videos.
The ‘3 V’s’ certainly give us an insight into the almost unimaginable scale of data, and
the break-neck speeds at which these vast datasets grow and multiply. But only
‘Variety’ really begins to scratch the surface of the depth- and crucially, the challenges-
of Big Data. [6]
Fig 2. Characteristics Of Big Data
9. Big Data Versus Small Data:
Parameter Small Data Big Data
Goals Usually designed to answer
a specific question or serve
a particular goal.
Usually designed with a goal in
mind, but the goal is flexible and the
questions posed are protean. Big
Data grants designed “to combine
high-quality data from fisheries,
Coast Guard, commercial shipping,
and coastal management agencies
for a growing data collections.
Location Typically, small data is
contained within one
institution, often on one
computer, sometimes in one
file.
Typically spread throughout
electronic space, typically parceled
onto multiple Internet servers,
located anywhere on earth.
Data structure and
content
Ordinarily contains highly
structured data. The data
domain is restricted to a
single discipline or sub
discipline. The data often
comes in the form of uniform
records in an ordered
spreadsheet.
Must be capable of absorbing
unstructured data (e.g., such as
free-text documents, images,
motion pictures, sound recordings,
physical objects). The subject
matter of the resource may cross
multiple disciplines, and the
individual data objects in the
resource may link to data contained
in other, seemingly unrelated, Big
Data resources.
Data preparation In many cases, the data
user prepares her own data,
for her own purposes.
The data comes from many diverse
sources, and it is prepared by many
people. People who use the data
are seldom the people who have
prepared the data.
Measurements Typically, the data is
measured using one
experimental protocol, and
the data can be represented
using one set of standard
units (see Glossary item,
Protocol).
Many different types of data are
delivered in many different
electronic formats. Measurements,
when present, may be obtained by
many different protocols. Verifying
the quality of Big Data is one of the
most difficult tasks for data
managers.[7]
10. Why Is Big Data Important?
The importance of big data doesn’t revolve around how much data you have, but what
you do with it. You can take data from any source and analyze it to find answers that
enable 1) cost reductions, 2) time reductions, 3) new product development and
optimized offerings, and 4) smart decision making. When you combine big data with
high-powered analytics, you can accomplish business-related tasks such as:
Determining root causes of failures, issues and defects in near-real time.
Generating coupons at the point of sale based on the customer’s buying habits.
Recalculating entire risk portfolios in minutes.
Detecting fraudulent behavior before it affects your organization.[8]
Big Data at the Edge:
Much of the current discussion about big data analytics today focuses on managing and
analyzing unstructured data from business and social sources such as e-mail, videos,
tweets, Face book posts, reviews, and Web behavior. While this type of big data
analytics promises to provide significant value to organizations, data generated at the
edge of the network from sensors and other devices represents another huge, untapped
resource with the potential to deliver insights that can transform the operations and
strategic initiatives of public and private sector organizations.
Data from intelligent systems and sensors is some of the largest volume, fastest
streaming, and/or most complex big data. The data sources are distributed across the
network and data is collectedby an enormous variety of equipment, such as utility
meters, traffic and security cameras, RFID readers, factory-line sensors, fitness
machines, and medical devices.
Ubiquitous connectivity and the growth of sensors and intelligent systems have opened
up a whole new storehouse of valuable information. Edge data can provide significant
value to both the private and public sector as a source of enormous potential for gaining
deeper, richer insight faster and more cost-effectively than in the past. In many cases,
analysis of edge data can help organizations respond to events and solve problems that
were previously out of reach.[9]
11. Big Data: Opportunities and Challenges:
In the distributed systems world, “Big Data” started to become a major issue in the late
1990‟s due to the impact of the world-wide Web and a resulting need to index and
query its rapidly mushrooming content. Database technology (including parallel
databases) was considered for the task, but was found to be neither well-suited nor
cost-effective for those purposes. The turn of the millennium then brought further
challenges as companies began to use information such as the topology of the Web
and users‟ search histories in order to provide increasingly useful search results, as well
as more effectively-targeted advertising to display alongside and fund those results.
Google‟s technical response to the challenges of Web-scale data management and
analysis was simple, by database standards, but kicked off what has become the
modern “Big Data” revolution in the systems world.
To handle the challenge of Web-scale storage, the Google File System (GFS) was
created. GFS provides clients with the familiar OS-level byte-stream abstraction, but it
does so for extremely large files whose content can span hundreds of machines in
shared-nothing clusters created using inexpensive commodity hardware. To handle the
challenge of processing the data in such large files, Google pioneered its Map Reduce
programming model and platform.
This model, characterized by some as “parallel programming for dummies”, enabled
Google‟s developers to process large collections of data by writing two user-defined
functions, map and reduce, that the Map Reduce framework applies to the instances
(map) and sorted groups of instances that share a common key (reduce) – similar to the
sort of partitioned parallelism utilized in shared-nothing parallel query processing.
Driven by very similar requirements, software developers at Yahoo!, Facebook, and
other large Web companies followed suit. Taking Google‟s GFS and Map Reduce
papers as rough technical specifications, open-source equivalents were developed, and
the Apache Hadoop Map Reduce platform and its underlying file system (HDFS, the
Hadoop Distributed File System) were born. The Hadoop system has quickly gained
traction, and it is now widely used for use cases including Web indexing, clickstream
and log analysis, and certain large-scale information extraction and machine learning
tasks. Soon tired of the low-level nature of the Map Reduce programming model, the
Hadoop community developed a set of higher-level declarative languages for writing
queries and data analysis pipelines that are compiled into Map Reduce jobs and then
executed on the Hadoop Map Reduce platform.
Popular languages include Pig from Yahoo!, Jaql from IBM, and Hive from Facebook.
Pig is relational-algebra-like in nature, and is reportedly used for over 60% of
Yahoo!‟sMapReduce use cases; Hive is SQL-inspired and reported to be used for over
90% of the Facebook Map Reduce use cases. Microsoft‟s technologies include a
parallel runtime system called Dryad and two higher-level programming models, Dryad
LINQ and the SQLlike SCOPE language, which utilizes Dryad under the covers.
Interestingly, Microsoft has also recently announced that its future “Big Data” strategy
includes support for Hadoop.[4]
12. Big Data Storage:
We live in on-demand, on-command Digital universe with data prolifering by Institutions,
Individuals and Machines at a very high rate. This data is categories as "Big Data" due
to its sheer Volume, Variety, Velocity and Veracity. Most of this data is unstructured,
quasi structured or semi structured and it is heterogeneous in nature. The volume and
the heterogeneity of data with the speed it is generated, makes it difficult for the present
computing infrastructure to manage Big Data. Traditional data management,
warehousing and analysis systems fall short of tools to analyze this data.
Due to its specific nature of Big Data, it is stored in distributed file system architectures.
Hadoop and HDFS by Apache is widely used for storing and managing Big Data.
Analyzing Big Data is a challenging task as it involves large distributed file systems
which should be fault tolerant, flexible and scalable. Map Reduce is widely been used
for the efficient analysis of Big Data. Traditional DBMS techniques like Joins and
Indexing and other techniques like graph search is used for classification and clustering
of Big Data. These techniques are being adopted to be used in Map Reduce.
Map Reduce framework over Hadoop Distributed File System (HDFS). Map Reduce is
a Minimization technique which makes use of file indexing with mapping, sorting,
shuffling and finally reducing. Map Reduce techniques have been studied at in this
paper which is implemented for Big Data analysis using HDFS.[11][12]
13. Big Data Processing:
Big Data encompasses everything from click stream data from the web to genomic and
proteomic data from biological research and medicines. Big Data is a heterogeneous
mix of data both structured (traditional datasets –in rows and columns like DBMS tables,
CSV's and XLS's) and unstructured data like e-mail attachments, manuals, images,
PDF documents, medical records such as x-rays, ECG and MRI images, forms, rich
media like graphics, video and audio, contacts, forms and documents. Businesses are
primarily concerned with managing unstructured data, because over 80 percent of
enterprise data is unstructured and require significant storage space and effort to
manage.“Big data” refers to datasets whose size is beyond the ability of typical
database software tools to capture, store, manage, and analyse.
Big data analytics is the area where advanced analytic techniques operate on big data
sets. It is really about two things, Big data and Analytics and how the two have teamed
up to create one of the most profound trends in business intelligence (BI) . Map Reduce
by itself is capable for analysing large distributed data sets; but due to the heterogeneity,
velocity and volume of Big Data, it is a challenge for traditional data analysis and
management tools. A problem with Big Data is that they use NoSQL and has no Data
Description Language (DDL) and it supports transaction processing. Also, web-scale
data is not universal and it is heterogeneous.
For analysis of Big Data, database integration and cleaning is much harder than the
traditional mining approaches. Parallel processing and distributed computing is
becoming a standard procedure which are nearly non-existent in RDBMS. Map Reduce
has following characteristics; it supports Parallel and distributed processing, it is simple
and its architecture is shared-nothing which has commodity diverse hardware (big
cluster).Its functions are programmed in a high-level programming language (e.g. Java,
Python) and it is flexible.
Query processing is done through NoSQL integrated in HDFS as Hive tool. Analytics
helps to discover what has changed and the possible solutions. Second, advanced
analytics is the best way to discover more business opportunities, new customer
14. segments, identify the best suppliers, associate products of affinity, understand sales
seasonality etc.[5][13]
Benefits of Big Data:
Understand customer need better.
Reduce cost.
Make processes more efficient.
Detect risks and check fraud.[14]
Drawbacks of Big Data:
o High Maintenance.
o Skill needed to access Data.
o Difficult to Handle.
o Violates the Privacy Principle.[14]
Applications of Big Data:
Government.
International development
Manufacturing
Cyber-Physical Models
Media
Technology
Private sector
Science and Research.[15]
15. Conclusion:
Big Data analysis tools like Map Reduce over Hadoop and HDFS, promises to help
organizations better understand their customers and the marketplace, hopefully leading
to better business decisions and competitive advantages. The need to process
enormous quantities of data has never been greater. Not only are terabyte- and
petabyte-scale datasets rapidly becoming commonplace, but there is consensus that
great value lies buried in them, waiting to be unlocked by the right computational tools.
In the commercial sphere, business intelligence, driven by the ability to gather data from
a dizzying array of sources. For engineers building information processing tools and
applications, large and heterogeneous datasets which are generating continuous flow of
data, lead to more effective algorithms for a wide range of tasks, from machine
translation to spam detection. In the natural and physical sciences, the ability to analyse
massive amounts of data may provide the key to unlocking the secrets of the cosmos or
the mysteries of life. MapReduce can be exploited to solve a variety of problems related
to text processing at scales that would have been unthinkable a few years ago.
We regard Big Data as an emerging trend and the need for Big Data is arising in all
science and engineering domains. With Big Data technologies, we will hopefully be able
to provide most relevant and most accurate social sensing feedback to better
understand our society at realtime. We can further stimulate the participation of the
public audiences in the data production circle for societal and economical events.The
development of big data extends the scope of human activities. It demands proper
attention from academia, industry and government. The world has been cooperating
and integrating on a global scale. Human is enforced to change mode from the local to
the global in their everyday life and work. It redefines the relationship among individuals,
businesses, organizations, governments, and societies through networked thinking and
further to improve the human living environment, to enhance the quality of public
services, to improve performance, efficiency and productivity through the intelligentized
interactive operating. The technological progress and industrial upgrading of big data
16. will create new markets, new business models and new industry rules, and more
importantly it demonstrates the collective will of acountry that looking for strategic
advantage. Although there isstill a large gap to gain data intelligence like human
wisdom big data is a promising topic and it certainly helps us to understand the world
from an entirely new aspect.[16]
REFRENCES:
[1].Ericsson White paper*
Ericsson, Ericsson Mobility Report, February 2015, available at:
http://www.ericsson.com/res/docs/2015/ericsson-mobility-report-feb-2015-interim.pdf
[2].NESSI White Paper*
* Big Data
A New World of Opportunities*
[3].Book: "Big Data for Beginners" by Alonzo Williams,Stepanie Foor.
[4].Puneet Singh Duggal ,Sanchita Paul , “Big Data Analysis : Challenges and
Solutions” , International Conference On Cloud, Big Data and Trust 2013 , Nov 13-15 ,
RGPV.
[5].Prashant Kumar, Khushboo Pandey, “Big Data and Distributed Data Mining: An
Example of Future Networks”, Volume 1, Issue 2 (2013) 36-39 International Journal of
Advance Research and Innovation.
[6].Book: "Understanding Big Data: A Beginners Guide to Data Science & the Business
Applications" by Eileen McNulty-Holmes.
[7].Book: "Principles of Big Data: Preparing, Sharing, and Analyzing Complex
Information" by Jules J. Berman.
[8].http://www.sas.com/en_th/insights/big-data/what-is-big-data.html
17. [9].Feng Ye, Zhijian Wang, Fachao Zhou, Yapu Wang, Yuanchao Zhou, “Cloud –based
Big Data Mining &Analyzing Services Platform integrating R”, 2013 International
Conference on Advanced Cloud and Big Data.
[10].Hanna Yang, Minjeong Park, Minsu Cho, Minseok Song, Seongjoo Kim, “A System
Architecture for Manufacturing Process Analysis based on Big Data and Process Mining
Techniques”, 2014 IEEE International Conference on Big Data.
[11].Xindong Wu , Fellow , IEEE , Xingquan Zhu , Senior Member , IEEE , Gong-Qing
Wu , and Wei Ding , Senior Member, IEEE , “Data Mining with Big Data”, IEEE
Transactions on knowledge and Data Engineering, VOL. 26, NO. 1 , January 2014.
[12].Sandy Moens ,EminAksehirli , Bart Goethals , “Frequent Itemset Mining for Big
Data”, 2013 IEEE International Conference on Big Data.
[13].Carson Kai-Sang Leung, Fan Jiang, “A Data Science Solution for Mining Interesting
Patterns from Uncertain Big Data”, 2014 IEEE Fourth International Conference on Big
Data and Cloud Computing.
[14].http://www.oii.ox.ac.uk/research/project/?id=98.
[15].https://www.google.co.in/#q=applications+of+big+data+wikipedia.
[16].Jason Venner, Pro Hadoop: Build Scalable, distributed applications in the
cloud ,ISBN-13 (pbk): 978-1-4302-1942-2, ISBN-13(electronic): 978-1-4302-1943-9.