SlideShare a Scribd company logo
1 of 23
Download to read offline
BIG DATA &
HADOOP
- V I N O T H K U M A R
WHAT WE ARE GOING TO LEARN?
• What is Bigdata?
• What is HADOOP?
• Distributed Computing.
• 4V’s of Big data.
• HADOOP Daemons.
• Writing files to HDFS.
• Reading files from HDFS.
• Replication factor
• Rack Awareness
• Map/Reduce
WHAT IS BIG DATA?
• Big data are collection of data sets so large and complex that it becomes difficult to process using on-
hand database management tools or traditional data processing applications (SOURCE :White Tom,
Definitive Guide).
• Why do we need to manage this Big data?
The data is growing enormously. In earlier days, employees and customers were generating data.(Eg,
Feedback form, Survey results).But in today's world, even the machines has started generating data (
Eg, Sensor, RFID, Satelllite). In fact, 90% of the world data is generated in last 3 years.
Company realized that it needs to manage this huge amount of data. Imagine the data flow that
happens in search engines like Google. Google came up with the idea of distributed computing and
parallel processing which is very well explained in their research papers : “Google File system and
MAP Reduce”. It showed the world how they were able to process this huge data.
http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
http://research.google.com/archive/mapreduce.html
Wait !!! What is the use of this big data?
Big data is used to better understand customers and their behaviors and preferences.Take flipkart as an
example. I’m a great lover of “Nike shoes”. I might have purchased a Nike shoe some 5 years ago. But for
some reason my interest turned towards “Adidas shoes” and I’m buying only Adidas shoes for last couple of
years. Based on the historical data, now flipkart knows that I’m interested much in Adidas shoes than the Nike
shoes. So I may get ads related to Adidas shoes whenever I visit flipkart.
The flipkart logs for the user “vinothkumar” will be
• In year 2010- Bought Nike shoes
• In year 2011- Bought Adidas shoes
• In year 2012- Bought Adidas shoes
• In year 2013- Bought Adidas shoes
• In year 2014- Bought Adidas shoes
• In year 2015- Bought Adidas shoes
Now this is considered to be the big data. Here we’ve considered only for one user “vinothkumar” and only
one product “shoes”. Imagine the logs of all the flipkart users. Company can use this big data to predict the
customer preference and sell products which adds profit to the organization.
This is one such example for Big data. It is also used in different fields like Hospital and Medical industry,Travel
Industry etc..,
Big data
WHAT IS HADOOP
• What is HADOOP ?
In simple words, HADOOP is a framework for managing/processing this big data.
The data is divided into smaller piece of chunks and the data is distributed across “n” number
of systems.We then process the data in parallel approach to obtain the final result.
NETWORK IDS – Real time application.(This is my graduate project .Source code on request )
Imagine a scenario, the company has realized that data breach has occurred and it wants to
track the malicious activity from a particular IP. The company had a wire shark running 24/7 to
capture the data packets and the file is stored in .pcap format. Imagine the size of the file to
be 1000 TB. Using traditional computing, it takes huge amount of time to process this data
where as in HADOOP framework, the data is split and stored in 1000 systems. Now each
system has to process just 1 TB of data and it is processed in parallel to obtain the result more
quickly .i.e the malicious activities observed from a particular IP.
DISTRIBUTED COMPUTING
• Let’s understand the concept of Distributed computing with a simple scenario.
• Consider “Ox and the Load” example. Ox are used to carry the load.
• When the load size increases, we didn’t decide to grow up the ox.
DISTRIBUTED COMPUTING
Instead, we decided to increase the number of the Ox.
The same concept is applied in the the process of distributed computing.
4 V’S OF BIG DATA
HADOOP DAEMONS
Daemon is a service provided by the operating system which runs in the background. It exits
as soon as the server exits/Shutdown. There are 5 daemons in the HADOOP Architecture
which are categorized into two types Masters and Slaves.
MASTERS
1. Name node
2. Task Tracker
3. Secondary Name node
SLAVES
4. Data node
5. JobTracker
• As the name says, SLAVES always acts as per the command /order received from the MASTER.
• Think this scenario like a typical job in IT Industry. Employees always report their work to
their corresponding reporting manager. Similarly, data node and job tracker are the SLAVES
which reports to the MASTER Name node and JobTracker correspondingly
• Name Node (NN):
– Heart of the Hadoop architecture.
– Contains metadata information like where the data is stored.
• Secondary Name Node (SNN):
– Nope, you’ve guessed it wrong. SNN is not exactly a back up for NN. Instead it will store the
“Checkpoints”. It means at that particular instance or checkpoint, it will take a back up of NN.
SNN is also known as checkpoint servers.
– Checkpoints (CP) :
If you are a gamer, you might have heard about the term called checkpoint. When your mission is
failed, you will resume from that checkpoint and not from the start of the game. Same thing goes
with SNN.When NN is failed, SNN will act as a temporary NN from that particular CP.
WRITING FILES TO HDFS
STEP 1 : Client wants to store the file chunks (Block A,B,C) named "File.txt" in HADOOP
Framework and it will seek the help of Name Node .
STEP 2: Name node queries its metadata information to find out the free space. It replies the
client that Data node - DN 1,5,6 are free and you can go ahead and store the data.
STEP 3 : Client stores these chunks of information in the data node accordingly.
NOTE : Storing and Retrieving files from HDFS is automatically taken care by HADOOP. i.e.We
need not worry about the location of our data in the data node. Only thing that HADOOP
expect from us is the Input Data and Mapper/Reducer program which is explained later.
READING FILES FROM HDFS
STEP 1: Client wants to retrieve the information of "Result.txt (Blk A,B,C) ". So it will seek
the help of Name node
STEP 2: Name Node queries its metadata information to find out where result.txt is stored.
It replies the client that
Blk A is stored in DN1,5,6
Blk B is stored in DN 8,1,2
Blk C is stored in DN 5,8,9
You may be wondering why HADOOP is storing the same block in 3 different location in the
data node, (Eg: Blk A is stored in DN 1, 5 and 6) that is where Replication Factor ( RF)
comes into picture.
REPLICATION FACTOR
HADOOP Architecture is designed in such a way that data loss shouldn’t occur. But still failure
may occur for whatsoever . In such cases, we need to ensure the availability of data. So each
block of data is replicated thrice and stored in the data node. So the replication factor for
Hadoop is defined as 3.
But If you notice carefully, you can see Blk A is stored twice in Rack-5 and once in Rack-1. This
is the specialty in HADOOP Architecture.
RACK AWARENESS
• If Block A (Blk A) has to be retrieved from the data node, hadoop will normally prefer the
RACK space which has stored two copies of the same Blk A. In the previous slide RACK-5.
• RACK-5 has two copies of Blk A stored in DN-5 and 6.
• Imagine a scenario, the file pointer (FP) is in RACK-5, and for some reasons DN-5 got failed.
So the Hadoop will have to fetch Blk A from either DN-6 present in RACK-5 or from DN-1
present in RACK-1.
• Search time to locate the block will be increased if hadoop has to go to DN-1 in RACK-1.
Instead it can quickly retrieve another copy of Blk A from DN-6 which is present in the same
RACK-5. So by default HADOOP stores two copies of block in the same RACK and another
copy in a different RACK.
• This concept is known as RACK AWARENESS that makes HADOOP more effective while
retrieving data.
MAP / REDUCE
• Map and Reduce are simply JAVA program which does the processing on the data. It simply
answers the following question.
“What you want to do with this data?”
Class Reducer()
{
java code
}
Class Mapper()
{
java code
}
Writing your first hadoop program
Word Count is a simple application that counts the number of occurrences of each word
in a given input set. This example is generally considered to be the“HelloWord”
program of the programming languages 
Visit the following link to learn how to install Hadoop and to write the mapper and
reducer code to find out the number of occurrence of the word
http://javabeginnerstutorial.com/hadoop/your-first-hadoop-map-reduce-job
• Mapper :
The mapper simply process the input data based on the java code written on the
mapper class (In the example above : To count the occurrence of word in input file) and
creates several small chunks of data.
• Reducer :
The reducer phase simply process the chunks of data created after mapper phase and
gives the output in the reduced or user required format after applying shuffle and sorting.
This can be better understood with the diagram which explained in the next slide.
Word count example
CONCLUSION
• Big data and Hadoop is considered to be the hottest topic in the IT industry. Everyone wants
to learn this technology . But we don’t have much professionals who has in depth knowledge in
these frameworks. So there is always a demand for big data engineers 
• Concept of HADOOP and BIG DATA is very vast. It’s very difficult to explain the whole
HADOOP ecosystem in slides. This presentation mainly focus on the beginners to kick start
their brain.exe in the field of Big data and HADOOP. 
• I’ve tried my best to explain the concept in a simple way that can be understood even by a
beginner.
• Feedbacks are always appreciated.
• Thanks for taking time to read my slides.
REFERENCES
• https://www.udemy.com/big-data-and-hadoop-essentials-free-tutorial
• www.Hadoop-skills.net
• https://www.udemy.com/overview-of-big-data-hadoop/
• http://www.tutorialspoint.com/hadoop/
THANK YOU

More Related Content

What's hot

Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 

What's hot (20)

Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Real time big data stream processing
Real time big data stream processing Real time big data stream processing
Real time big data stream processing
 
Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeMigrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data Lake
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop Administration
 
Big_data_ppt
Big_data_ppt Big_data_ppt
Big_data_ppt
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 

Viewers also liked

Viewers also liked (11)

AWS security - NULL meet chennai
AWS security - NULL meet chennaiAWS security - NULL meet chennai
AWS security - NULL meet chennai
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
 
BigData - Hadoop -by 侯圣文@secooler
BigData - Hadoop -by 侯圣文@secooler BigData - Hadoop -by 侯圣文@secooler
BigData - Hadoop -by 侯圣文@secooler
 
Client side attacks using PowerShell
Client side attacks using PowerShellClient side attacks using PowerShell
Client side attacks using PowerShell
 
Introduction to Bigdata Analysis
Introduction to Bigdata AnalysisIntroduction to Bigdata Analysis
Introduction to Bigdata Analysis
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Similar to Introduction to Bigdata and HADOOP

Similar to Introduction to Bigdata and HADOOP (20)

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
NameNode Analytics - Querying HDFS Namespace in Real Time
NameNode Analytics - Querying HDFS Namespace in Real TimeNameNode Analytics - Querying HDFS Namespace in Real Time
NameNode Analytics - Querying HDFS Namespace in Real Time
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCEHADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by Keylabs
 
Data Science
Data ScienceData Science
Data Science
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @Scale
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Hadoop content
Hadoop contentHadoop content
Hadoop content
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 

More from vinoth kumar

More from vinoth kumar (6)

G suite misconfigurations- Null meet, Chennai
G suite misconfigurations- Null meet, ChennaiG suite misconfigurations- Null meet, Chennai
G suite misconfigurations- Null meet, Chennai
 
Github security bug bounty hunting
Github security   bug bounty huntingGithub security   bug bounty hunting
Github security bug bounty hunting
 
Oauth 2.0 security
Oauth 2.0 securityOauth 2.0 security
Oauth 2.0 security
 
Securing your vpc in aws
Securing your vpc in awsSecuring your vpc in aws
Securing your vpc in aws
 
Yet another talk on bug bounty
Yet another talk on bug bountyYet another talk on bug bounty
Yet another talk on bug bounty
 
API Security - Null meet
API Security - Null meetAPI Security - Null meet
API Security - Null meet
 

Recently uploaded

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 

Recently uploaded (20)

General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 

Introduction to Bigdata and HADOOP

  • 1. BIG DATA & HADOOP - V I N O T H K U M A R
  • 2. WHAT WE ARE GOING TO LEARN? • What is Bigdata? • What is HADOOP? • Distributed Computing. • 4V’s of Big data. • HADOOP Daemons. • Writing files to HDFS. • Reading files from HDFS. • Replication factor • Rack Awareness • Map/Reduce
  • 3. WHAT IS BIG DATA? • Big data are collection of data sets so large and complex that it becomes difficult to process using on- hand database management tools or traditional data processing applications (SOURCE :White Tom, Definitive Guide). • Why do we need to manage this Big data? The data is growing enormously. In earlier days, employees and customers were generating data.(Eg, Feedback form, Survey results).But in today's world, even the machines has started generating data ( Eg, Sensor, RFID, Satelllite). In fact, 90% of the world data is generated in last 3 years. Company realized that it needs to manage this huge amount of data. Imagine the data flow that happens in search engines like Google. Google came up with the idea of distributed computing and parallel processing which is very well explained in their research papers : “Google File system and MAP Reduce”. It showed the world how they were able to process this huge data. http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf http://research.google.com/archive/mapreduce.html
  • 4. Wait !!! What is the use of this big data? Big data is used to better understand customers and their behaviors and preferences.Take flipkart as an example. I’m a great lover of “Nike shoes”. I might have purchased a Nike shoe some 5 years ago. But for some reason my interest turned towards “Adidas shoes” and I’m buying only Adidas shoes for last couple of years. Based on the historical data, now flipkart knows that I’m interested much in Adidas shoes than the Nike shoes. So I may get ads related to Adidas shoes whenever I visit flipkart. The flipkart logs for the user “vinothkumar” will be • In year 2010- Bought Nike shoes • In year 2011- Bought Adidas shoes • In year 2012- Bought Adidas shoes • In year 2013- Bought Adidas shoes • In year 2014- Bought Adidas shoes • In year 2015- Bought Adidas shoes Now this is considered to be the big data. Here we’ve considered only for one user “vinothkumar” and only one product “shoes”. Imagine the logs of all the flipkart users. Company can use this big data to predict the customer preference and sell products which adds profit to the organization. This is one such example for Big data. It is also used in different fields like Hospital and Medical industry,Travel Industry etc.., Big data
  • 5. WHAT IS HADOOP • What is HADOOP ? In simple words, HADOOP is a framework for managing/processing this big data. The data is divided into smaller piece of chunks and the data is distributed across “n” number of systems.We then process the data in parallel approach to obtain the final result. NETWORK IDS – Real time application.(This is my graduate project .Source code on request ) Imagine a scenario, the company has realized that data breach has occurred and it wants to track the malicious activity from a particular IP. The company had a wire shark running 24/7 to capture the data packets and the file is stored in .pcap format. Imagine the size of the file to be 1000 TB. Using traditional computing, it takes huge amount of time to process this data where as in HADOOP framework, the data is split and stored in 1000 systems. Now each system has to process just 1 TB of data and it is processed in parallel to obtain the result more quickly .i.e the malicious activities observed from a particular IP.
  • 6. DISTRIBUTED COMPUTING • Let’s understand the concept of Distributed computing with a simple scenario. • Consider “Ox and the Load” example. Ox are used to carry the load. • When the load size increases, we didn’t decide to grow up the ox.
  • 7. DISTRIBUTED COMPUTING Instead, we decided to increase the number of the Ox. The same concept is applied in the the process of distributed computing.
  • 8. 4 V’S OF BIG DATA
  • 9.
  • 10. HADOOP DAEMONS Daemon is a service provided by the operating system which runs in the background. It exits as soon as the server exits/Shutdown. There are 5 daemons in the HADOOP Architecture which are categorized into two types Masters and Slaves. MASTERS 1. Name node 2. Task Tracker 3. Secondary Name node SLAVES 4. Data node 5. JobTracker
  • 11. • As the name says, SLAVES always acts as per the command /order received from the MASTER. • Think this scenario like a typical job in IT Industry. Employees always report their work to their corresponding reporting manager. Similarly, data node and job tracker are the SLAVES which reports to the MASTER Name node and JobTracker correspondingly • Name Node (NN): – Heart of the Hadoop architecture. – Contains metadata information like where the data is stored. • Secondary Name Node (SNN): – Nope, you’ve guessed it wrong. SNN is not exactly a back up for NN. Instead it will store the “Checkpoints”. It means at that particular instance or checkpoint, it will take a back up of NN. SNN is also known as checkpoint servers. – Checkpoints (CP) : If you are a gamer, you might have heard about the term called checkpoint. When your mission is failed, you will resume from that checkpoint and not from the start of the game. Same thing goes with SNN.When NN is failed, SNN will act as a temporary NN from that particular CP.
  • 13. STEP 1 : Client wants to store the file chunks (Block A,B,C) named "File.txt" in HADOOP Framework and it will seek the help of Name Node . STEP 2: Name node queries its metadata information to find out the free space. It replies the client that Data node - DN 1,5,6 are free and you can go ahead and store the data. STEP 3 : Client stores these chunks of information in the data node accordingly. NOTE : Storing and Retrieving files from HDFS is automatically taken care by HADOOP. i.e.We need not worry about the location of our data in the data node. Only thing that HADOOP expect from us is the Input Data and Mapper/Reducer program which is explained later.
  • 15. STEP 1: Client wants to retrieve the information of "Result.txt (Blk A,B,C) ". So it will seek the help of Name node STEP 2: Name Node queries its metadata information to find out where result.txt is stored. It replies the client that Blk A is stored in DN1,5,6 Blk B is stored in DN 8,1,2 Blk C is stored in DN 5,8,9 You may be wondering why HADOOP is storing the same block in 3 different location in the data node, (Eg: Blk A is stored in DN 1, 5 and 6) that is where Replication Factor ( RF) comes into picture.
  • 16. REPLICATION FACTOR HADOOP Architecture is designed in such a way that data loss shouldn’t occur. But still failure may occur for whatsoever . In such cases, we need to ensure the availability of data. So each block of data is replicated thrice and stored in the data node. So the replication factor for Hadoop is defined as 3. But If you notice carefully, you can see Blk A is stored twice in Rack-5 and once in Rack-1. This is the specialty in HADOOP Architecture.
  • 17. RACK AWARENESS • If Block A (Blk A) has to be retrieved from the data node, hadoop will normally prefer the RACK space which has stored two copies of the same Blk A. In the previous slide RACK-5. • RACK-5 has two copies of Blk A stored in DN-5 and 6. • Imagine a scenario, the file pointer (FP) is in RACK-5, and for some reasons DN-5 got failed. So the Hadoop will have to fetch Blk A from either DN-6 present in RACK-5 or from DN-1 present in RACK-1. • Search time to locate the block will be increased if hadoop has to go to DN-1 in RACK-1. Instead it can quickly retrieve another copy of Blk A from DN-6 which is present in the same RACK-5. So by default HADOOP stores two copies of block in the same RACK and another copy in a different RACK. • This concept is known as RACK AWARENESS that makes HADOOP more effective while retrieving data.
  • 18. MAP / REDUCE • Map and Reduce are simply JAVA program which does the processing on the data. It simply answers the following question. “What you want to do with this data?” Class Reducer() { java code } Class Mapper() { java code }
  • 19. Writing your first hadoop program Word Count is a simple application that counts the number of occurrences of each word in a given input set. This example is generally considered to be the“HelloWord” program of the programming languages  Visit the following link to learn how to install Hadoop and to write the mapper and reducer code to find out the number of occurrence of the word http://javabeginnerstutorial.com/hadoop/your-first-hadoop-map-reduce-job • Mapper : The mapper simply process the input data based on the java code written on the mapper class (In the example above : To count the occurrence of word in input file) and creates several small chunks of data. • Reducer : The reducer phase simply process the chunks of data created after mapper phase and gives the output in the reduced or user required format after applying shuffle and sorting. This can be better understood with the diagram which explained in the next slide.
  • 21. CONCLUSION • Big data and Hadoop is considered to be the hottest topic in the IT industry. Everyone wants to learn this technology . But we don’t have much professionals who has in depth knowledge in these frameworks. So there is always a demand for big data engineers  • Concept of HADOOP and BIG DATA is very vast. It’s very difficult to explain the whole HADOOP ecosystem in slides. This presentation mainly focus on the beginners to kick start their brain.exe in the field of Big data and HADOOP.  • I’ve tried my best to explain the concept in a simple way that can be understood even by a beginner. • Feedbacks are always appreciated. • Thanks for taking time to read my slides.
  • 22. REFERENCES • https://www.udemy.com/big-data-and-hadoop-essentials-free-tutorial • www.Hadoop-skills.net • https://www.udemy.com/overview-of-big-data-hadoop/ • http://www.tutorialspoint.com/hadoop/