2. Verifiable Certificate
Experienced Instructor
Live Online Class
Survey Feedback
24x7 Support
In-class Questions
Class Recording in LMS
Module Wise Assessment
Project Work
Android & iOS App
www.edureka.co/big-data-and-hadoopSlide 2
How it Works?
3. Objectives
www.edureka.co/big-data-and-hadoopSlide 3
At the end of this module, you will be able to
Understand What is Big data
Analyse limitations and solutions of existing
Data Analytics Architecture
Understand What is Hadoop and its features
Hadoop Ecosystem
Understand Hadoop 2.x core components
Perform Read and Write in Hadoop
Understand Rack Awareness concept
Work on Edureka’s VM
4. Lots of Data (Terabytes or Petabytes)
Big data is the term for a collection of data sets so
large and complex that it becomes difficult to process
using on-hand database management tools or
traditional data processing applications
The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and visualization
What is Big Data?
cloud
www.edureka.co/big-data-and-hadoopSlide 4
tools
statistics
No SQL
compression
storage
support
database
analyze
information
terabytes
mobile
processing
Big Data
5. What is Big Data?
Amazon handles 15 million
customer click stream user
data per day to
recommend products.
www.edureka.co/big-data-and-hadoopSlide 5
Stock market generates about one
terabyte of new trade data per day
to perform stock trading analytics to
determine trends for optimal trades.
294 billion emails sent every
day. Services analyse this data
to find the spams.
Systems / Enterprises generate huge amount of data from Terabytes to Petabytes of information
6. Un-structured Data is Exploding
40,000 EB, or 40 Zettabytes By 2020, IDC (International Data Corporation) predicts the number will have reached
(ZB)
The world’s information is doubling every two years. By 2020, there will be 5,200 GB of data for every person on
Earth.
Un-structured Data
7000
6000
5000
4000
3000
2000
1000
0
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Structured Data Un-structured Data
www.edureka.co/big-data-and-hadoopSlide 6
7. IBM’s Definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
IBM’s Definition of Big Data
VOLUME
Web
logs
Audios
Images
Videos
Sensor
Data
VARIETYVELOCITY VERACITY
Min Max Mean SD
4.3 7.9 5.84 0.83
2.0 4.4 3.05 0.43
0.1 2.5 1.20 0.76
www.edureka.co/big-data-and-hadoopSlide 7
9. Annie’s Question
Map the following to corresponding data type:
» XML files, e-mail body
» Audio, Video, Images, Archived documents
» Data from Enterprise systems (ERP, CRM etc.)
www.edureka.co/big-data-and-hadoopSlide 9
10. Annie’s Answer
Ans. XML files, e-mail body Semi-structured data
Audio, Video, Image, Files, Archived documents Unstructured
data
Data from Enterprise systems (ERP, CRM etc.) Structured
data
www.edureka.co/big-data-and-hadoopSlide 10
11. Further Reading
More on Big Data
http://www.edureka.in/blog/the-hype-behind-big-data/
Why Hadoop?
http://www.edureka.in/blog/why-hadoop/
Opportunities in Hadoop
http://www.edureka.in/blog/jobs-in-hadoop/
Big Data
http://en.wikipedia.org/wiki/Big_Data
IBM’s definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
www.edureka.co/big-data-and-hadoopSlide 11
12. Common Big Data Customer Scenarios
Web and e-tailing
» Recommendation Engines
» Ad Targeting
» SearchQuality
» Abuse and Click Fraud Detection
Telecommunications
» Customer Churn Prevention
» Network Performance Optimization
» Calling Data Record (CDR) Analysis
» Analysing Network to Predict Failure
http://wiki.apache.org/hadoop/PoweredBy
www.edureka.co/big-data-and-hadoopSlide 12
13. Government
» Fraud Detection and Cyber Security
» Welfare Schemes
» Justice
Healthcare and Life Sciences
» Health Information Exchange
» Gene Sequencing
» Serialization
» Healthcare Service Quality Improvements
» Drug Safety
Common Big Data Customer Scenarios (Contd.)
http://wiki.apache.org/hadoop/PoweredBy
www.edureka.co/big-data-and-hadoopSlide 13
14. Common Big Data Customer Scenarios (Contd.)
Banks and Financial services
» Modeling True Risk
» ThreatAnalysis
» Fraud Detection
» Trade Surveillance
» Credit Scoring and Analysis
Retail
» Point of Sales Transaction Analysis
» Customer Churn Analysis
» Sentiment Analysis
http://wiki.apache.org/hadoop/PoweredBy
www.edureka.co/big-data-and-hadoopSlide 14
15. Slide 17 www.edureka.co/big-data-and-hadoop
Hidden Treasure
*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc., to store and process the customer activity and sales data.
Insight into data can provide Business Advantage.
Case Study: Sears Holding Corporation
Some key early indicators can mean Fortunes to Business.
More Precise Analysis with more data.
http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
16. Mostly Append
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
ETL Compute Grid
Storage only Grid (Original Raw Data)
Collection
Inctrumentation
A meagre
10% of the
~2PB data is
available for
BI
Storage
2. Moving data to compute
doesn’t scale
90% of
the ~2PB
archived
Processing
3. Premature data
death
1. Can’t explore original
high fidelity raw data
Limitations of Existing Data Analytics Architecture
www.edureka.co/big-data-and-hadoopSlide 16
17. BI Reports + Interactive Apps
RDBMS (Aggregated Data)
Hadoop : Storage + Compute Grid
Collection
Instrumentation
Both
Storage
And
Processing
Entire ~2PB
Data is
available for
processing
No Data
Archiving
1. Data Exploration &
Advanced analytics
2. Scalable throughput for ETL &
aggregation
Mostly Append
3. Keep data alive
forever
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as
was the case with existing Non-Hadoop solutions.
www.edureka.co/big-data-and-hadoopSlide 17
Solution: A Combined Storage Computer Layer
18. Why DFS?
Read 1 TB Data
1 Machine
4 I/O Channels
Each Channel – 100 MB/s
10 Machine
4 I/O Channels
Each Channel – 100 MB/s
www.edureka.co/big-data-and-hadoopSlide 18
21. What is Hadoop?
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters
of commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage and distributed processing.
www.edureka.co/big-data-and-hadoopSlide 21
23. Hadoop Design Principles
www.edureka.co/big-data-and-hadoopSlide 23
Facilitate the storage and processing of large and/or rapidly growing data sets
» Structured and unstructured data
» Simple programming models
Scale-Out rather than Scale-Up
Bring Code to Data rather than data to code
High scalability and availability
Use commodity hardware
Fault-tolerance
24. Hadoop – It’s about Scale and Structure
Structured Data Types Multi and Unstructured
Limited, No Data Processing Processing Processing coupled with Data
Standards & Structured Governance Loosely Structured
Required On Write Schema Required On Read
Reads are Fast Speed Writes are Fast
Software License Cost Support Only
Known Entity Resources Growing, Complexities, Wide
OLTP
Complex ACID Transactions
Operational Data Store
Best Fit Use Data Discovery
Processing Unstructured Data
Massive Storage/Processing
RDBMS HADOOP
www.edureka.co/big-data-and-hadoopSlide 24
25. Annie’s Question
Hadoop is a framework that allows for the distributed
processing of:
» Small Data Sets
» Large Data Sets
www.edureka.co/big-data-and-hadoopSlide 25
26. Annie’s Answer
Ans. Large Data Sets.
It is also capable of processing small data-sets. However, to
experience the true power of Hadoop, one needs to have
data in TB’s. Because this is where RDBMS takes hours and
fails whereas Hadoop does the same in couple of minutes.
www.edureka.co/big-data-and-hadoopSlide 26
27. Hadoop Ecosystem
Pig Latin
Data Analysis
Hive
DWSystem
Other
YARN
Frameworks
(MPI,GRAPH)
HBaseMapReduce Framework
YARN
Cluster Resource Management
Apache Oozie
(Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
Hive
DW System
Apache Oozie
(Workflow)
HDFS
(Hadoop Distributed File System)
Mahout
MachineLearning
MapReduce Framework
HBase
Hadoop 1.0 Hadoop 2.0
Sqoop
Unstructured or
Semi-structured Data Structured Data
Flume
Sqoop
Unstructured or
Semi-structured Data Structured Data
Flume
Mahout
Machine Learning
www.edureka.co/big-data-and-hadoopSlide 27
28. Machine Learning with Mahout
https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
Write intelligent applications using Apache Mahout
LinkedIn Recommendations
Hadoop and
MapReduce magic in
action
www.edureka.co/big-data-and-hadoopSlide 28
31. Main Components of HDFS
NameNode:
» Master of the system
» Maintains and manages the blocks which are present on
the DataNodes
DataNodes:
» Slaves which are deployed on each machine and provide
the actual storage
» Responsible for serving read and write requests for the
clients
www.edureka.co/big-data-and-hadoopSlide 31
32. NameNode Metadata
Meta-data in Memory
» The entire metadata is in main memory
» No demand paging of FS meta-data
Types of Metadata
» List of files
» List of Blocks for each file
» List of DataNode for each block
» File attributes, e.g. access time, replication factor
A Transaction Log
» Records file creations, file deletions etc.
NameNode
(Stores metadata only)
METADATA:
/user/doug/hinfo -> 1 3 5
/user/doug/pdetail -> 4 2
www.edureka.co/big-data-and-hadoopSlide 32
NameNode:
Keeps track of overall file directory
structure and the placement of Data Block
33. File Blocks
By Default, block size is 128mb in Hadoop 2.x and 64mb in Hadoop 1.x
Why block size is large?
» The main reason for having the HDFS blocks in large size is to reduce the cost of seek time.
» The large block size is to account for proper usage of storage space while considering the limit on the
memory of name node.
200mb – emp.dat
64mb – Block1
64mb – Block2
64mb – Block3
8mb – Block4
200mb – abc.txt
128mb – Block 1
72mb – Block 2
Hadoop 2.x
www.edureka.co/big-data-and-hadoopSlide 33
Hadoop 1.x
43. Annie’s Question
In HDFS, blocks of a file are written in parallel, however
the replication of the blocks are done sequentially:
a. TRUE
b. FALSE
www.edureka.co/big-data-and-hadoopSlide 43
45. Annie’s Question
A file of 400MB is being copied to HDFS. The system has
finished copying 250MB. What happens if a client tries to
access that file:
a. Can read up to block that's successfully written.
b. Can read up to last bit successfully written.
c. Will throw an exception.
d. Cannot see that file until its finished copying.
www.edureka.co/big-data-and-hadoopSlide 45
47. Another Hadoop Distributors
Cloudera distributes a platform of open-source projects called
Cloudera's Distribution including Apache Hadoop or CDH.
These include architectural services and technical support for
Hadoop clusters in development or in production.
Another major player in the Hadoop market, Hortonworks
has the largest number of committers and code contributors
for the Hadoop ecosystem components.Provider of expert
technical support, training and partner-enablement services
for both end-user organizations and technology vendors.
The MapR Distribution including Apache Hadoop provides
you with an enterprise-grade distributed data platform to
reliably store and process big data. MapR's Apache Hadoop
distribution claims to provide full data protection, no single
points of failure, improved performance
www.edureka.co/big-data-and-hadoopSlide 47
49. Further Reading
Apache Hadoop and HDFS
http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/
Apache Hadoop HDFS Architecture
http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/
www.edureka.co/big-data-and-hadoopSlide 49
50. A tour of Edureka’s Virtual Machine
www.edureka.co/big-data-and-hadoopSlide 50
51. Edureka VM
Pre-requisites
Remote Login usYinEgS
Putty Hadoop-2.2.0Condition
NO
YES
Run the commands to
start the daemons
manually
Edureka VM
Split Files
If the file is
downloaded
completely
NO
YES
Proceed with the steps
mentioned in the “Edureka
VM Installation “document
Import Edureka
VM to Virtual Box
Check for
daemons
YES
NO
Refer README.txt from
Edureka VM Desktop
Edureka VM
Installation
www.edureka.co/big-data-and-hadoopSlide 51
52. Pre-requisites
Pre-requisites to install Edureka VM
» Minimum 4 GB RAM
» Dual Core Processor or above
If you don’t have the
mentioned hardware
requirements, refer to slide 57
www.edureka.co/big-data-and-hadoopSlide 52
53. Installing Edureka VM
With reference to “ Edureka VM Installation ” document present in the LMS, you can install Edureka VM
We suggest you to use
Download Manager
the
while
downloading Edureka VM to avoid
any network issues that may occur.
You can download it from here
for different platforms which is an
open source tool.
www.edureka.co/big-data-and-hadoopSlide 53
54. Challenges Faced During Installation of VM
File may not get downloaded completely, due to its huge size, without displaying any error.
After downloading VM always
check for its file size it should
be 4.5 GB
Download Complete Download Incomplete
Proceed with the steps
mentioned in the “Edureka
VM Installation “document Using “Edureka VM Split Files”
document present in LMS, you
can download Edureka VM
www.edureka.co/big-data-and-hadoopSlide 54
55. If your system does not meet
minimum requirements please refer
to the document “ Remote Login
using Putty Hadoop-3.0” present in
the LMS
With reference to “ Remote Login Using Putty – Hadoop 2.2.0 ” document present in the LMS, you can
access our Edureka VM Remote Server
Edureka VM Remote Server
www.edureka.co/big-data-and-hadoopSlide 55
56. Challenges Faced While Importing VM in Virtual Box
Below is the error you may face while importing the Edureka VM in Virtual Box, if the Edureka VM file has
not been downloaded completely
In case you are using split
files, then before importing
split files in Virtual Box
combine all the downloaded
split files using hjsplit
application.
www.edureka.co/big-data-and-hadoopSlide 56
57. If daemons are not up and running, execute below commands
After executing the above commands check if all daemons are running fine by using:
After importing VM in Virtual Box, start the VM and check if all the daemons are up and running by using:
Command: sudo jps
Challenges Faced After Importing VM in Virtual Box
Command: sudo service hadoop-master stop
Command: sudo service hadoop-master start
Command: hadoop dfsadmin -safemode leave
www.edureka.co/big-data-and-hadoopSlide 57
Command: sudo jps
58. Please refer to the README.txt file present in the Desktop of Edureka VM to know more about it
Edureka VM Desktop
www.edureka.co/big-data-and-hadoopSlide 58
59. Assignment
Referring the documents present in the LMS under assignment solve the below problem.
How many such DataNodes you would need to read 100TB data in 5 minutes in your Hadoop Cluster?
www.edureka.co/big-data-and-hadoopSlide 59
60. Pre-work for next Class
Setup the Hadoop development environment using the documents present in the LMS.
» Setup Edureka’s VirtualMachine
» Execute Linux BasicCommands
» Execute HDFS Hands On commands
Attempt the Module-1 Assignments present in the LMS.
www.edureka.co/big-data-and-hadoopSlide 60
61. Hadoop 2.x Cluster Architecture
Hadoop clustermodes
Basic Hadoopcommands
Hadoop 2.x configuration files and its parameters
Password-less SSH on Hadoopcluster
Dump of a MapReduceprogram
Data loadingtechniques
www.edureka.co/big-data-and-hadoopSlide 61
Agenda for Next Class
62. Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!
Please spare few minutes to take the survey after the webinar.
www.edureka.co/big-data-and-hadoopSlide 62
Survey