2. Big data is a collection of data sets so large and
complex that it becomes difficult to process
using on-hand database management tools.
“Big data” isn’t just a technology—it’s a
business strategy for capitalizing on
information resources
3. Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the tim
Sensor technology and
networks
(measuring all kinds of data)
The progress and innovation is no longer hindered by the ability to collect data
But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
3
5. Analy
st
IT
I need to evaluate the possible
relationship between client
salary and overdrafts
OK. We have to evaluate a lot
of statistics, set the correct
db indexes and db
partitioning. It will take us 5
days.
7. Analy
st
IT
Great. I can see here some nice
correlations. Now I need to look
at it from the different
perspective.
Ohhh, welcome dear friend.
Understand. So, it’s ….
another 5 days of our work
Noooo!!!
It’s not possible to
work here!
10. Hadoop Distributed File System
Data is organized in files and directory.
Files are divided into blocks and distributed across
cluster nodes.
Block placement is done at runtime.
Replication
Blocks are replicated to handle error.
Checksum is used to check data integrity.
18. 18
Shuffle & Sorting
based on k
Reduce
Reduce
Reduce
Map
Map
Map
Map
Input blocks
on HDFS
Produces (k,
v)
( , 1)
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Consumes(k, [v])
( ,
[1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Users only provide the “Map” and “Reduce” functions
19.
20.
21. Apache Avro: designed for communication between
Hadoop nodes through data serialization
Cassandra and Hbase: a non-relational database
designed for use with Hadoop
Hive: a query language similar to SQL (HiveQL) but
compatible with Hadoop
Mahout: an AI tool designed for machine learning;
that is, to assist with filtering data for analysis and
exploration
Pig Latin: A data-flow language and execution
framework for parallel computation
ZooKeeper: Keeps all the parts coordinated and
working together