Understanding Big Data And Hadoop

Module-1
Understanding Big Data And Hadoop
www.edureka.co/big-data-and-hadoop

Verifiable Certificate
Experienced Instructor
Live Online Class
Survey Feedback
24x7 Support
In-class Questions
Class Recording in LMS
Module Wise Assessment
Project Work
Android & iOS App
www.edureka.co/big-data-and-hadoopSlide 2
How it Works?

Objectives
At the end of this module, you will be able to
 Understand What is Big data
 Analyse limitations and solutions of existing
Data Analytics Architecture
 Understand What is Hadoop and its features
 Hadoop Ecosystem
 Understand Hadoop 2.x core components
 Perform Read and Write in Hadoop
 Understand Rack Awareness concept
 Work on Edureka’s VM

 Lots of Data (Terabytes or Petabytes)
 Big data is the term for a collection of data sets so
large and complex that it becomes difficult to process
using on-hand database management tools or
traditional data processing applications
 The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and visualization
What is Big Data?
cloud
tools
statistics
No SQL
compression
storage
support
database
analyze
information
terabytes
mobile
processing
Big Data

What is Big Data?
Amazon handles 15 million
customer click stream user
data per day to
recommend products.
Stock market generates about one
terabyte of new trade data per day
to perform stock trading analytics to
determine trends for optimal trades.
294 billion emails sent every
day. Services analyse this data
to find the spams.
Systems / Enterprises generate huge amount of data from Terabytes to Petabytes of information

Un-structured Data is Exploding
40,000 EB, or 40 Zettabytes By 2020, IDC (International Data Corporation) predicts the number will have reached
(ZB)
 The world’s information is doubling every two years. By 2020, there will be 5,200 GB of data for every person on
Earth.
Un-structured Data
7000
6000
5000
4000
3000
2000
1000
0
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Structured Data Un-structured Data

IBM’s Definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
IBM’s Definition of Big Data
VOLUME
Web
logs
Audios
Images
Videos
Sensor
Data
VARIETYVELOCITY VERACITY
Min Max Mean SD
4.3 7.9 5.84 0.83
2.0 4.4 3.05 0.43
0.1 2.5 1.20 0.76

Annie’s Introduction
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.

Annie’s Question
Map the following to corresponding data type:
» XML files, e-mail body
» Audio, Video, Images, Archived documents
» Data from Enterprise systems (ERP, CRM etc.)

Annie’s Answer
Ans. XML files, e-mail body  Semi-structured data
Audio, Video, Image, Files, Archived documents  Unstructured
data
Data from Enterprise systems (ERP, CRM etc.)  Structured
data

Further Reading
More on Big Data
http://www.edureka.in/blog/the-hype-behind-big-data/
Why Hadoop?
http://www.edureka.in/blog/why-hadoop/
Opportunities in Hadoop
http://www.edureka.in/blog/jobs-in-hadoop/
Big Data
http://en.wikipedia.org/wiki/Big_Data
IBM’s definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/

Common Big Data Customer Scenarios
 Web and e-tailing
» Recommendation Engines
» Ad Targeting
» SearchQuality
» Abuse and Click Fraud Detection
 Telecommunications
» Customer Churn Prevention
» Network Performance Optimization
» Calling Data Record (CDR) Analysis
» Analysing Network to Predict Failure
http://wiki.apache.org/hadoop/PoweredBy

 Government
» Fraud Detection and Cyber Security
» Welfare Schemes
» Justice
 Healthcare and Life Sciences
» Health Information Exchange
» Gene Sequencing
» Serialization
» Healthcare Service Quality Improvements
» Drug Safety
Common Big Data Customer Scenarios (Contd.)

Common Big Data Customer Scenarios (Contd.)
 Banks and Financial services
» Modeling True Risk
» ThreatAnalysis
» Fraud Detection
» Trade Surveillance
» Credit Scoring and Analysis
 Retail
» Point of Sales Transaction Analysis
» Customer Churn Analysis
» Sentiment Analysis

www.edureka.co/big-data-and-hadoop
Hidden Treasure
*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc., to store and process the customer activity and sales data.
 Insight into data can provide Business Advantage.
Case Study: Sears Holding Corporation
 Some key early indicators can mean Fortunes to Business.
 More Precise Analysis with more data.
http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?

Mostly Append
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
ETL Compute Grid
Storage only Grid (Original Raw Data)
Collection
Inctrumentation
A meagre
10% of the
~2PB data is
available for
BI
Storage
2. Moving data to compute
doesn’t scale
90% of
the ~2PB
archived
Processing
3. Premature data
death
1. Can’t explore original
high fidelity raw data
Limitations of Existing Data Analytics Architecture

BI Reports + Interactive Apps
RDBMS (Aggregated Data)
Hadoop : Storage + Compute Grid
Collection
Instrumentation
Both
Storage
And
Processing
Entire ~2PB
Data is
available for
processing
No Data
Archiving
1. Data Exploration &
Advanced analytics
2. Scalable throughput for ETL &
aggregation
Mostly Append
3. Keep data alive
forever
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as
was the case with existing Non-Hadoop solutions.
Solution: A Combined Storage Computer Layer

Why DFS?
Read 1 TB Data
1 Machine
4 I/O Channels
Each Channel – 100 MB/s
10 Machine
4 I/O Channels

Why DFS? (Contd.)
1 Machine
4 I/O Channels
10 Machine
4 I/O Channels
43 Minutes
Read 1 TB Data

Why DFS? (Contd.)
1 Machine
4 I/O Channels
10 Machine
4 I/O Channels
4.3 Minutes43 Minutes
Read 1 TB Data

What is Hadoop?
 Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters
of commodity computers using a simple programming model.
 It is an Open-source Data Management with scale-out storage and distributed processing.

Hadoop Key Characteristics
Reliable
EconomicalFlexible
Scalable
Hadoop
Features

Hadoop Design Principles
 Facilitate the storage and processing of large and/or rapidly growing data sets
» Structured and unstructured data
» Simple programming models
 Scale-Out rather than Scale-Up
 Bring Code to Data rather than data to code
 High scalability and availability
 Use commodity hardware
 Fault-tolerance

Hadoop – It’s about Scale and Structure
Structured Data Types Multi and Unstructured
Limited, No Data Processing Processing Processing coupled with Data
Standards & Structured Governance Loosely Structured
Required On Write Schema Required On Read
Reads are Fast Speed Writes are Fast
Software License Cost Support Only
Known Entity Resources Growing, Complexities, Wide
OLTP
Complex ACID Transactions
Operational Data Store
Best Fit Use Data Discovery
Processing Unstructured Data
Massive Storage/Processing
RDBMS HADOOP

Annie’s Question
Hadoop is a framework that allows for the distributed
processing of:
» Small Data Sets
» Large Data Sets

Annie’s Answer
Ans. Large Data Sets.
It is also capable of processing small data-sets. However, to
experience the true power of Hadoop, one needs to have
data in TB’s. Because this is where RDBMS takes hours and
fails whereas Hadoop does the same in couple of minutes.

Hadoop Ecosystem
Pig Latin
Data Analysis
Hive
DWSystem
Other
YARN
Frameworks
(MPI,GRAPH)
HBaseMapReduce Framework
YARN
Cluster Resource Management
Apache Oozie
(Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
Hive
DW System
Apache Oozie
(Workflow)
HDFS
(Hadoop Distributed File System)
Mahout
MachineLearning
MapReduce Framework
HBase
Hadoop 1.0 Hadoop 2.0
Sqoop
Unstructured or
Semi-structured Data Structured Data
Flume
Sqoop
Unstructured or
Semi-structured Data Structured Data
Flume
Mahout
Machine Learning

Machine Learning with Mahout
https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
Write intelligent applications using Apache Mahout
LinkedIn Recommendations
Hadoop and
MapReduce magic in
action

Hadoop 2.x Core Components
Hadoop 2.x Core Components
HDFS YARN
Storage Processing
DataNode
NameNode Resource Manager
Node Manager
Master
Slave
Secondary
NameNode

Hadoop 2.x Core Components ( Contd.)
DataNode
Node
Manager
DataNode DataNode DataNode
YARN
HDFS
Cluster
Resource
Manager
NameNode
Node
Manager
Node
Manager
Node
Manager

Main Components of HDFS
 NameNode:
» Master of the system
» Maintains and manages the blocks which are present on
the DataNodes
 DataNodes:
» Slaves which are deployed on each machine and provide
the actual storage
» Responsible for serving read and write requests for the
clients

NameNode Metadata
 Meta-data in Memory
» The entire metadata is in main memory
» No demand paging of FS meta-data
 Types of Metadata
» List of files
» List of Blocks for each file
» List of DataNode for each block
» File attributes, e.g. access time, replication factor
 A Transaction Log
» Records file creations, file deletions etc.
NameNode
(Stores metadata only)
METADATA:
/user/doug/hinfo -> 1 3 5
/user/doug/pdetail -> 4 2
NameNode:
Keeps track of overall file directory
structure and the placement of Data Block

File Blocks
 By Default, block size is 128mb in Hadoop 2.x and 64mb in Hadoop 1.x
 Why block size is large?
» The main reason for having the HDFS blocks in large size is to reduce the cost of seek time.
» The large block size is to account for proper usage of storage space while considering the limit on the
memory of name node.
200mb – emp.dat
64mb – Block1
64mb – Block2
64mb – Block3
8mb – Block4
200mb – abc.txt
128mb – Block 1
72mb – Block 2
Hadoop 2.x
Hadoop 1.x

Replication and Rack Awareness

Replication and Rack Awareness (Contd.)

Pipelined Write

Client reading file from HDFS

Annie’s Question
In HDFS, blocks of a file are written in parallel, however
the replication of the blocks are done sequentially:
a. TRUE
b. FALSE

Annie’s Answer
Ans. True. A file is divided into Blocks, these blocks are
written in parallel, but the block replication happen in
sequence.

Annie’s Question
A file of 400MB is being copied to HDFS. The system has
finished copying 250MB. What happens if a client tries to
access that file:
a. Can read up to block that's successfully written.
b. Can read up to last bit successfully written.
c. Will throw an exception.
d. Cannot see that file until its finished copying.

Annie’s Answer
Ans. Option (a)
Client can read up to the successfully written data block.

Another Hadoop Distributors
Cloudera distributes a platform of open-source projects called
Cloudera's Distribution including Apache Hadoop or CDH.
These include architectural services and technical support for
Hadoop clusters in development or in production.
Another major player in the Hadoop market, Hortonworks
has the largest number of committers and code contributors
for the Hadoop ecosystem components.Provider of expert
technical support, training and partner-enablement services
for both end-user organizations and technology vendors.
The MapR Distribution including Apache Hadoop provides
you with an enterprise-grade distributed data platform to
reliably store and process big data. MapR's Apache Hadoop
distribution claims to provide full data protection, no single
points of failure, improved performance

Another Hadoop Distributors

Further Reading
 Apache Hadoop and HDFS
http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/
 Apache Hadoop HDFS Architecture
http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/

A tour of Edureka’s Virtual Machine

Edureka VM
Pre-requisites
Remote Login usYinEgS
Putty Hadoop-2.2.0Condition
NO
YES
Run the commands to
start the daemons
manually
Edureka VM
Split Files
If the file is
downloaded
completely
NO
YES
Proceed with the steps
mentioned in the “Edureka
VM Installation “document
Import Edureka
VM to Virtual Box
Check for
daemons
YES
NO
Refer README.txt from
Edureka VM Desktop
Edureka VM
Installation

Pre-requisites
 Pre-requisites to install Edureka VM
» Minimum 4 GB RAM
» Dual Core Processor or above
If you don’t have the
mentioned hardware
requirements, refer to slide 57

Installing Edureka VM
 With reference to “ Edureka VM Installation ” document present in the LMS, you can install Edureka VM
We suggest you to use
Download Manager
the
while
downloading Edureka VM to avoid
any network issues that may occur.
You can download it from here
for different platforms which is an
open source tool.

Challenges Faced During Installation of VM
 File may not get downloaded completely, due to its huge size, without displaying any error.
After downloading VM always
check for its file size it should
be 4.5 GB
Download Complete Download Incomplete
Proceed with the steps
mentioned in the “Edureka
VM Installation “document Using “Edureka VM Split Files”
document present in LMS, you
can download Edureka VM

If your system does not meet
minimum requirements please refer
to the document “ Remote Login
using Putty Hadoop-3.0” present in
the LMS
 With reference to “ Remote Login Using Putty – Hadoop 2.2.0 ” document present in the LMS, you can
access our Edureka VM Remote Server
Edureka VM Remote Server

Challenges Faced While Importing VM in Virtual Box
 Below is the error you may face while importing the Edureka VM in Virtual Box, if the Edureka VM file has
not been downloaded completely
In case you are using split
files, then before importing
split files in Virtual Box
combine all the downloaded
split files using hjsplit
application.

 If daemons are not up and running, execute below commands

 After executing the above commands check if all daemons are running fine by using:
 After importing VM in Virtual Box, start the VM and check if all the daemons are up and running by using:
Command: sudo jps
Challenges Faced After Importing VM in Virtual Box
Command: sudo service hadoop-master stop
Command: sudo service hadoop-master start
Command: hadoop dfsadmin -safemode leave
Command: sudo jps

 Please refer to the README.txt file present in the Desktop of Edureka VM to know more about it
Edureka VM Desktop

Assignment
Referring the documents present in the LMS under assignment solve the below problem.
How many such DataNodes you would need to read 100TB data in 5 minutes in your Hadoop Cluster?

Pre-work for next Class
Setup the Hadoop development environment using the documents present in the LMS.
» Setup Edureka’s VirtualMachine
» Execute Linux BasicCommands
» Execute HDFS Hands On commands
Attempt the Module-1 Assignments present in the LMS.

 Hadoop 2.x Cluster Architecture
 Hadoop clustermodes
 Basic Hadoopcommands
 Hadoop 2.x configuration files and its parameters
 Password-less SSH on Hadoopcluster
 Dump of a MapReduceprogram
 Data loadingtechniques
Agenda for Next Class

Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!
Please spare few minutes to take the survey after the webinar.
Survey

Understanding Big Data And Hadoop

Understanding Big Data And Hadoop

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Understanding Big Data And Hadoop

Similaire à Understanding Big Data And Hadoop (20)

Plus de Edureka!

Plus de Edureka! (20)

Dernier

Dernier (20)

Understanding Big Data And Hadoop