1. What is Big Data
2. Big Data generators
3. Why Big Data
4. Characteristic of Big Data
5. Big Data – A world wide problem
6. Solution for Big Data
8. How Big Data Impact on IT
9. Future of Big Data
What is big data?
Big data is a collection of large and complex data sets
which becomes difficult to process using on-hand database
management tools or traditional data processing
In simpler terms,
Big Data is a term given to large volumes of data that
organizations store and process.
Huge amount of data
+ From the beginning of recorded time until 2003,we
created 5 billion gigabytes (exabytes) of data.
+ In 2011, the same amount was created every two days
+ In 2013, the same amount of data is created every 10
Types of Data Generators
This data comes from everywhere:
<> sensors used to gather climate information,
<> posts to social media sites,
<> digital pictures
<> online Shopping
<> Hospitality data
<> purchase transaction records, and many more…
This data is “ big data.”
Big Data Requires ?
• Growth of Big Data is needed
– Increase of storage capacities
– Increase of processing power
– Availability of data(different data types)
– Every day we create 2.5 quintillion bytes of data;
90% of the data in the world today has been created
in the last two years alone
Big Data stores
• Choosing the correct data stores based on
your data characteristics
• Data center people maintain these servers
and these servers can be IBM, EMC server etc.
• Whenever you want to process data
– Fetch data.
– Give it to your local machine.
– Then process.
1st Character of Big Data
• It refers to vast amount of data generated every second.
•The size of available data has been growing at an increasing rate.
•Today, Facebook ingests 500 terabytes of new data every day.
• The smart phones, the data they create and consume; sensors
embedded into everyday objects will soon result in billions of new,
constantly-updated data feeds containing environmental, location,
and other information, including video.
2nd Character of Big Data
• It refers to speed at which new data is being generated.
• Speed at which data moves around.
• Clickstreams and ad impressions capture user behavior at
millions of events per second
• machine to machine processes exchange data between
billions of devices
• on-line gaming systems support millions of concurrent users,
each producing multiple inputs per second.
3rd Character of Big Data
• It refers to different types of data we are now using.
• In past we only focused on structured data that
nearly fitted into tables and relational databases.
• Nowa days 80% data is unstructured (text, images ,
video,voice) or semi structured (log files)
• Big Data analysis includes different types of data
Big Data! A Worldwide Problem ?
It is becoming very difficult for companies to
store, retrieve and process the ever-increasing
The problem lies in the use of traditional
systems to store enormous data.
These systems were a success a few years
ago, with increasing amount and complexity
of data, these are soon becoming obsolete.
• When data is less , processing speed is feasible
• As soon as data increases, processing is not
that much good.
• Thus for more data, processing should be
• Thus, HADOOP is introduced as a best
Solution for Big Data !
The good news is - Hadoop,
Panacea for all those companies working with
BIG DATA in a variety of applications
It has become an integral part for storing,
handling, evaluating and retrieving hundreds
or even petabytes of data.
Hadoop was developed by Doug Cutting and
Michael J. Cafarella.
Hadoop is an open source software
It supports data-intensive distributed
Hadoop is licensed under the Apache v2
Therefore known as Apache Hadoop.
Core concepts of Hadooop
• HDFS (Hadoop Distributed File System)
Technique for storing huge amount of data.
• Map Reduce
Technique for processing the data which we
are storing in HDFS.
• It is a specially designed file system for storing huge
data sets with cluster of commodity h/w and with
streaming access pattern.
cluster - Set of machines working togather
commodity h/w -Cheap hardware
streaming access pattern - Write ones, read any no of
times but dont try to change the content of file ones
you are keeping data in HDFS
• Its HDFS (Hadoop Distributed File System)
splits files into large blocks (default 64MB or
128MB) and distributes the blocks
• Amongst the nodes in the cluster.
• For processing the data, the
Hadoop Map/Reduce ships code to the nodes
that have the required data, and the nodes
then process the data in parallel.
• It is a technique for processing a data which we are storing in
• Hadoop runs map reduce in form of key , value pairs.
• Mapper and Reducer also works with key, value pairs.
• Record reader is a interface between input split and mapper
• For every input split and mapper there is one record reader.
• Record reader has been taken care by hadoop framework
itself by default
• In Mapper code we are writting logic on basis of that logic it
will give key,value pairs
• Record reader on basis of 3 file formats converts records into
– Text Input Format (by default)
– KeyValueText Input Format
– SequenceFile Input Format
– It is a phase on intermediate data to combine all
key values pairs into a collection associated to
• Sorting :
– It is also an another phase on intermediate data to
sort all key values pairs.
How Big data impacts on IT
• Big data is a troublesome force presenting
opportunities with challenges to IT organizations.
• By 2015 4.4 million IT jobs in Big Data ; 1.9 million
is in US itself
• India will require a minimum of 1 lakh data
scientists in the next couple of years in addition
to data analysts and data managers to support
the Big Data space.
Future of Big Data
• $15 billion on software firms only specializing in
data management and analytics.
• This industry on its own is worth more than $100
billion and growing at almost 10% a year which is
roughly twice as fast as the software business as a
• In February 2012, the open source analyst firm
Wikibon released the first market forecast for Big
Data , listing $5.1B revenue in 2012 with growth to
$53.4B in 2017
• The McKinsey Global Institute estimates that data
volume is growing 40% per year.