2. What is big
data
Big data is also data with a huge size
Big data is a term used to describe collection of data that is
huge in volume and yet growing exponentially with time
In short such data is so large and complex that none of the
traditional data management tools are able to process it
Machine learning is the largest source of big data
Three types of diverse data sources are machine data,
Organizational data and people
An example of machine data is Weather station sensor output
An example of Organizational data is Disease data from Center
for Disease Control
3. Advantages
of big data
REAL-TIME
NOTIFICATION ENABLES
REAL-TIME ACTIONS
DESIGN DIFFERENTLY CULTURE SHIFT TO REAL
TIME OPERATIONS
INCREASED USE OF
SCALABILITY COMPUTING
SUPERVISORY CONTROL
AND DATA
ACQUISITION(SCADA)
4. Big data generated
by people :
Unstructured data
Company Data processed
daily
eBay 100
Petabytes(PB)
Google 100 PB
Facebook 30+PB
Spotify 64 Terabytes
Twitter 100 Terabytes
7. Organization generated data-Benefits come from
combining with other types
The UPS success 16 million shipments
per day
40 million tracking
requests
UPS is estimated to
have 16PBs of data
around its operations
50 million dollar
savings
Route
optimization
SAVINGSLarge
operational
data
Optimization
algorithms
13. The Basic
Hadoop
Components
• Hadoop Common - libraries and
utilities
• Hadoop Distributed File System
(HDFS) – a distributed file-system
• Hadoop YARN – a resource-
management platform, scheduling
• Hadoop MapReduce – a programming
model for large scale data processing
15. Original HDFS Design Goals
• Resilience to hardware
failure
• Streaming data access • Support for large dataset,
scalability to
hundreds/thousands of nodes
with high aggregate
bandwidth
• Application locality to data • Portability across
heterogeneous hardware and
software platforms
16. Original HDFS
Design
• Single NameNode - a
master server that manages
the file system namespace
and regulates access to files
by clients.
• Multiple DataNodes –
typically one per node in the
cluster.
Functions • Manage storage
• Serving read/write requests
from clients
• Block creation, deletion,
replication based on
instructions from NameNode
18. MapReduce
Framework
• Software framework – for writing parallel data
processing applications
• MapReduce job splits data into chunks
• Map tasks process data chunks
• Framework sorts map output
• Reduce tasks use sorted map data as input
20. YARN: NexGen MapReduce
• Main idea – Separate
resource management and
job scheduling/monitoring.
• Global ResourceManager
(RM)
• NodeManager on each node • ApplicationMaster – one for
each application
21. YARN FEATURES
• HIGH AVAILABILITY
RESOURCEMANAGER
• TIMELINE SERVER • USE OF CGROUPS • SECURE
CONTAINERS
• YARN – WEB
SERVICES REST APIS