2. What is Big Data?
Big data refers to huge amount of digital information collected from multiple
and different sources.
Big Data is one of those things that is completely transforming the way we
are doing the everyday things which leaves a digital trace which can be
used and analyzed. Big Data refers to our ability to make use of the ever
increasing volumes of data .An aim to solve new problems or old problems
in a better way.
3. Data generated by us :
Mobile Devices
Conversation data
Photo and video Image data
Social Networks data
Satellites
The Internet Of Things data
4. Big Data are characterized by 3v’s:
Volume – Data Quantity
Velocity - Data Speed
Variety – Types of data
Storing Big Data
Analyzing data characteristics
Selecting data sources for analysis
Eliminating redundant data
5. Processing Big Data
Mapping data to programming frame work
Connecting and extracting data from storage
Transforming data for processing
Subdividing data for Hadoop MapReduce
Creating the components of Hadoop MapReduce jobs
Executing Hadoop MapReduce Jobs
Monitoring the progress of the job flows
6. The Structure of Big Data
Structured – Traditional data sources , the data stored in
fields in a database
Semi-structured – a form of structured data that doesn’t
conform with the formal structure of the data models of relat
ional databases and also has tags or other markers to sepa
rate semantic elements within the data
Unstructured – video data , audio data , the data that do
esn’t reside in a traditional row-column database .
7. How is Big Data actually used?
Some examples…
Better understand and target customers
Understand and optimize business processes
Improving health
Improving security
Improving sports performance
Improving and optimizing cities and countries
There are endless applications of Big Data. Any business t
hat doesn’t seriously consider the implications of big data
runs in the risk of being left behind!
8. Infrastructure of Big Data
To handle different dimensions of big data in terms of volume , ve
locity, variety an effective and efficient design has to used proces
s large amount of data arriving at high speed from different sourc
es .Multiple faces are present here
Multi-source Big data generation
Big data Storage
Big data Processing
Cloud Computing and Big Data
Big Data needs massive amounts of memory or storage space fo
r all the data to be stored in .This is where Cloud Computing com
es into the picture which is cost saving ,scalable , provides variet
y of services like - huge processing power, high storage capabilit
y.
9. Survey paper on Big Data(IEEE)
Ms.Vibhavari Chavan, Prof.Rajesh.N.Phursule(IJCSIT paper)
Big Data usually includes data set with sizes beyond the ability of
commonly used software tools to capture, manage and process dat
a within a tolerable elapsed time .
Size of big data is constantly a moving target.
Big Data is a set of techniques and technologies that require new
form of integration to uncover large hidden values from large data s
ets.
Big data environment is used to organize and analyze various typ
es of data.
Map Reduce framework generates a lot of intermediate data.
10. Hadoop
Hadoop is open source framework
Hadoop framework is written in java
Response time varies depending on the complexity of the process
Massive scalability is the key advantage
Currently used for index web searches , email spam detection, pred
iction in financial services etc.
By storing data hadoop consists of 2components:
HDFS , Map Reduce
11. HDFS
HDFS is the file system component of Hadoop framework designed a
nd optimized to store large amounts of data on low cost hardware. Arch
itecture of HDFS has :
Name Node - kind of master node having the information abo
ut metadata. All data node address, free space, active passive type dat
a node, stored data, job tracker.
Data Node – Data node is a type of slave node in the hadoop,
which is used to save the data and there is task tracker in data node w
hich is use to track on the ongoing job on the data node and the jobs w
hich coming from name node.
14. PIG
Initially developed by Yahoo! Is a programming language used to handle any k
ind of data.
Pig had two components:
first being the language itself called “PigLatin”
second is the runtime environment where the PigLatin programs are
executed .
Look at the programming language itself so that easier than having to write
mapper and reducer programs:
• The first step in this language is to LOAD the data to be manipulate
d into HDFS
• Then run the data through a set of TRANSFORMations (in turn conve
rted into mapper and reducer tasks )
• DUMP the data to the screen or STORE the results elsewhere.
15. HIVE
Initially developed by Facebook now Apache HIVE is a data warehouse infrast
ructure built on top of hadoop for query, data summarization and analysis.
Supports analysis of datasets stored in Hadoop’s HDFS and other compatible
file systems
Different storage types – plain text, HBase and other
Metadata storage in RDBMS ,reduces time for semantic checks
Operating on compressed data stored in Hadoop
Built-in User-defined Functions(UDF’s)
SQL like queries “HiveQL” that are implicitly converted into MapReduce jobs
It provides indexes including bit map indexes to fasten the queries.
16. HBase
HBase is a column-oriented Database where as HDFS is file system
HBase has a table format with rows and columns and each table sho
uld have a Primary Key defined in it that is used for all accesses in this
HBase table. Allows many attributes to be grouped into Column familie
s .
Table schema should be predefined along with the column families ,b
ut is flexible enough to add new columns to the families at any time ,ma
king the schema flexible .
Just as HDFS’s NameNode and slave nodes MapReduce also has Jo
bTracker and TaskTracker slave nodes .
Availability of NameNode in this case is also a concern jus as in HDF
S , and is also sensitive to loss of information of the master node
17. Conclusion
Hadoop MapReduce is an open source framework used for data-sensiti
ve ,reliable, fault tolerant, scalable data, has many implementation opti
ons and allows rewriting algorithms into MapReduce.
The framework breaks up large data into smaller chunks and handles it
.
We can present the design and evaluation of a data aware cache fram
ework that requires minimum change to the original MapReduce progra
mming model for provisioning incremental processing for Big data appli
cations using the MapReduce model.