SlideShare une entreprise Scribd logo
1  sur  20
APACHE HADOOP
BY:SAILI MANE
ID NO:12IT113
Q. WHAT IS BIG DATA????
• Big data is a term used to describe the voluminous
amount of unstructured and semi-structured data a
company creates.
• Data that would take too much time and cost too
much money to load into a relational database for
analysis.
• Big data doesn't refer to any specific quantity, the
term is often used when speaking about petabytes
and exabytes of data.
SO WHAT IS THE PROBLEM??
• The problem is that while the storage capacities of
hard drives have increased massively over the years,
access speeds—the rate at which data can be read from
drives have not kept up.
• One typical drive from 1990 could store 1370 MB of
data and had a transfer speed of 4.4 MB/s, so we could
read all the data from a full drive in around 300 seconds.
• In 2010, 1 Tb drives are the standard hard disk size,
but the transfer speed is around 100 MB/s, so it takes
more than two and a half hours to read all the data off
the disk.
Possible solutions!!!!
• Parallelization- Multiple processors or CPU’s in a
single machine
• Distributed Computing- Multiple computers
connected via a network
The key issues involved in this Solution:
• Hardware failure
• Combine the data after analysis
• Network Associated Problems
TO THE RESCUE!!!!...HADOOP!!
• framework for storing and processing big data on lots of
commodity machines.
• Open Source Apache project
• High reliability done in software
• Implemented in Java
• A common way of avoiding data loss is through replication
• Hadoop is the popular open source implementation of
MapReduce, a powerful tool designed for deep analysis and
transformation of very large data sets.
INTRODUCTION
Hadoop has two main layers:
•Computation layer: The computation tier uses framework called
MapReduce.
•Distributed storage layer: A distributed filesystem called HDFS provides
storage.
WHY HADOOP???
• Building bigger and bigger servers is no longer necessarily the
best solution to large-scale problems. Nowadays the popular
approach is to tie together many low-end machines together
as a single functional distributed system. For example,
• A high-end machine with four I/O channels each having a
throughput of 100 MB/sec will require three hours to read a 4
TB data set! With Hadoop, this same data set will be divided
into smaller (typically 64 MB) blocks that are spread among
many machines in the cluster via the Hadoop Distributed File
System (HDFS).
• With a modest degree of replication, the cluster machines can
read the data set in parallel and provide a much higher
throughput. Moreover its cheaper than one high-end server!
For computationally intensive work,
• Most of the distributed systems are having approach of moving
the data to the place where computation will take place And
after the computation, the resulting data is moved back for
storage. This approach works fine for computationally intensive
work.
For data-intensive work,
• We need other better approach, Hadoop has better philosophy
toward that Because Hadoop focuses on moving code/algorithm
to data instead data to the code/algorithm.
• The move-code-to-data philosophy applies within the Hadoop
cluster itself, And data is broken up and distributed across the
cluster, And computation on a piece of data takes place on the
same machine where that piece of data resides.
Hadoop philosophy of move-code-to-data makes more sense As
we know the code/algorithm are always smaller than the Data
hence code/algorithm is easier to move around.
WHY HADOOP???
HADOOP FEATURES
FEATURES
ROBUST
SCALABLE
SIMPLE
ACCESIBLE
HDFS!!!
REDUNDANT STORAGE…!!!
HDFS ARCHITECTURE
Namenodes and Datanodes!!!
• a namenode (the master) and a
number of datanodes (workers).
• The namenode manages the
filesystem namespace. It maintains
the filesystem tree and the
metadata for all the files and
directories in the tree.
• Datanodes are the work horses of
the filesystem. They store and
retrieve blocks when they are told
to (by clients or the namenode),
and they report back to the
namenode periodically with lists
of blocks that they are storing.
Goals of HDFS
GOALS
STREAMING
DATA ACCESS
COMMODITY
HARDWARE
SIMPLE
COHERENCY
MODEL
PORTABILITY
MAPREDUCE!!!
• Jobtracker receives map-reduce job execution request
from Client.
• Does sanity checks to see if the job is configured properly.
• Computes the input splits.
• Loads resources required for the job into HDFS
• Assigns splits to tasktrackers for map and reduce phases
• Map split assignment is data-locality-aware
• Single point of failure
• Tasktracker creates a new process for the task and
executes it.
• Sends periodic heartbeats to the Jobtracker, along with
other information about the task.
MAPREDUCE THINKING!!
MapReduce data flow with multiple reduce tasks
ADVANTAGES
• Hadoop is a platform which provides Distributed storage &
Computational capabilities both.
• Hadoop is extremely scalable
• optimized for high throughput.
• HDFS uses large block sizes that ultimately helps It works best when
manipulating large files
• Scalability and Availability are the distinguished features of
HDFS to achieve data replication and fault tolerance system.
• HDFS can replicate files for specified number of times that is tolerant of
software and hardware failure
• Hadoop uses MapReduce framework which is a batch-based, distributed
computing framework, Itallows paralleled work over a large amount of
data.
• MapReduce let the developers to focus on addressing business needs
only, rather than getting involved in distributed system complications.
• MapReduce decomposes the job into Map & Reduce tasks and schedules
them for remote execution.
DISADVANTGES
• As you know Hadoop uses HDFS and MapReduce, Both of
their master processes are single points of failure, Although
there is active work going on for High Availability versions.
• Until the Hadoop 2.x release, HDFS and MapReduce will be
using single-master models which can result in single points of
failure.
• Security is also one of the major concern because Hadoop
does offer a security model But by default it is
disabled because of its high complexity.
• Hadoop does not offer storage or network level encryption
• HDFS is inefficient for handling small files, and it
lacks transparent compression.
• MapReduce is a batch-based architecture that means it does
not lend itself to use cases which needs real-time data access.
• MapReduce is a shared-nothing architecture hence Tasks that
require global synchronization or sharing of mutable data are
not a good fit .
Hadoop related technologies
Apache hadoop basics

Contenu connexe

Tendances

Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentalsits_skm
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of HadoopKnoldus Inc.
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfsdatabloginfo
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataWANdisco Plc
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaData Con LA
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 

Tendances (19)

Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of Hadoop
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfs
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 

Similaire à Apache hadoop basics

Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFSKavyaGo
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenmaharajothip1
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxAltafKhadim
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 

Similaire à Apache hadoop basics (20)

Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Chapter2.pdf
Chapter2.pdfChapter2.pdf
Chapter2.pdf
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 

Dernier

Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...lizamodels9
 
Kenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith PereraKenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith Pereraictsugar
 
FULL ENJOY Call girls in Paharganj Delhi | 8377087607
FULL ENJOY Call girls in Paharganj Delhi | 8377087607FULL ENJOY Call girls in Paharganj Delhi | 8377087607
FULL ENJOY Call girls in Paharganj Delhi | 8377087607dollysharma2066
 
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In.../:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...lizamodels9
 
Independent Call Girls Andheri Nightlaila 9967584737
Independent Call Girls Andheri Nightlaila 9967584737Independent Call Girls Andheri Nightlaila 9967584737
Independent Call Girls Andheri Nightlaila 9967584737Riya Pathan
 
Call Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any TimeCall Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any Timedelhimodelshub1
 
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...ictsugar
 
MAHA Global and IPR: Do Actions Speak Louder Than Words?
MAHA Global and IPR: Do Actions Speak Louder Than Words?MAHA Global and IPR: Do Actions Speak Louder Than Words?
MAHA Global and IPR: Do Actions Speak Louder Than Words?Olivia Kresic
 
Contemporary Economic Issues Facing the Filipino Entrepreneur (1).pptx
Contemporary Economic Issues Facing the Filipino Entrepreneur (1).pptxContemporary Economic Issues Facing the Filipino Entrepreneur (1).pptx
Contemporary Economic Issues Facing the Filipino Entrepreneur (1).pptxMarkAnthonyAurellano
 
Cybersecurity Awareness Training Presentation v2024.03
Cybersecurity Awareness Training Presentation v2024.03Cybersecurity Awareness Training Presentation v2024.03
Cybersecurity Awareness Training Presentation v2024.03DallasHaselhorst
 
Innovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfInnovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfrichard876048
 
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort ServiceCall US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Servicecallgirls2057
 
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City GurgaonCall Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaoncallgirls2057
 
Ten Organizational Design Models to align structure and operations to busines...
Ten Organizational Design Models to align structure and operations to busines...Ten Organizational Design Models to align structure and operations to busines...
Ten Organizational Design Models to align structure and operations to busines...Seta Wicaksana
 
Marketplace and Quality Assurance Presentation - Vincent Chirchir
Marketplace and Quality Assurance Presentation - Vincent ChirchirMarketplace and Quality Assurance Presentation - Vincent Chirchir
Marketplace and Quality Assurance Presentation - Vincent Chirchirictsugar
 
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfIntro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfpollardmorgan
 
Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Kirill Klimov
 
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...lizamodels9
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...lizamodels9
 
8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCRashishs7044
 

Dernier (20)

Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
 
Kenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith PereraKenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith Perera
 
FULL ENJOY Call girls in Paharganj Delhi | 8377087607
FULL ENJOY Call girls in Paharganj Delhi | 8377087607FULL ENJOY Call girls in Paharganj Delhi | 8377087607
FULL ENJOY Call girls in Paharganj Delhi | 8377087607
 
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In.../:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
 
Independent Call Girls Andheri Nightlaila 9967584737
Independent Call Girls Andheri Nightlaila 9967584737Independent Call Girls Andheri Nightlaila 9967584737
Independent Call Girls Andheri Nightlaila 9967584737
 
Call Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any TimeCall Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any Time
 
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
 
MAHA Global and IPR: Do Actions Speak Louder Than Words?
MAHA Global and IPR: Do Actions Speak Louder Than Words?MAHA Global and IPR: Do Actions Speak Louder Than Words?
MAHA Global and IPR: Do Actions Speak Louder Than Words?
 
Contemporary Economic Issues Facing the Filipino Entrepreneur (1).pptx
Contemporary Economic Issues Facing the Filipino Entrepreneur (1).pptxContemporary Economic Issues Facing the Filipino Entrepreneur (1).pptx
Contemporary Economic Issues Facing the Filipino Entrepreneur (1).pptx
 
Cybersecurity Awareness Training Presentation v2024.03
Cybersecurity Awareness Training Presentation v2024.03Cybersecurity Awareness Training Presentation v2024.03
Cybersecurity Awareness Training Presentation v2024.03
 
Innovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfInnovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdf
 
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort ServiceCall US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
 
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City GurgaonCall Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
 
Ten Organizational Design Models to align structure and operations to busines...
Ten Organizational Design Models to align structure and operations to busines...Ten Organizational Design Models to align structure and operations to busines...
Ten Organizational Design Models to align structure and operations to busines...
 
Marketplace and Quality Assurance Presentation - Vincent Chirchir
Marketplace and Quality Assurance Presentation - Vincent ChirchirMarketplace and Quality Assurance Presentation - Vincent Chirchir
Marketplace and Quality Assurance Presentation - Vincent Chirchir
 
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfIntro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
 
Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024
 
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
 
8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR
 

Apache hadoop basics

  • 2. Q. WHAT IS BIG DATA???? • Big data is a term used to describe the voluminous amount of unstructured and semi-structured data a company creates. • Data that would take too much time and cost too much money to load into a relational database for analysis. • Big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data.
  • 3. SO WHAT IS THE PROBLEM?? • The problem is that while the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives have not kept up. • One typical drive from 1990 could store 1370 MB of data and had a transfer speed of 4.4 MB/s, so we could read all the data from a full drive in around 300 seconds. • In 2010, 1 Tb drives are the standard hard disk size, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk.
  • 4. Possible solutions!!!! • Parallelization- Multiple processors or CPU’s in a single machine • Distributed Computing- Multiple computers connected via a network The key issues involved in this Solution: • Hardware failure • Combine the data after analysis • Network Associated Problems
  • 5. TO THE RESCUE!!!!...HADOOP!! • framework for storing and processing big data on lots of commodity machines. • Open Source Apache project • High reliability done in software • Implemented in Java • A common way of avoiding data loss is through replication • Hadoop is the popular open source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of very large data sets.
  • 6. INTRODUCTION Hadoop has two main layers: •Computation layer: The computation tier uses framework called MapReduce. •Distributed storage layer: A distributed filesystem called HDFS provides storage.
  • 7. WHY HADOOP??? • Building bigger and bigger servers is no longer necessarily the best solution to large-scale problems. Nowadays the popular approach is to tie together many low-end machines together as a single functional distributed system. For example, • A high-end machine with four I/O channels each having a throughput of 100 MB/sec will require three hours to read a 4 TB data set! With Hadoop, this same data set will be divided into smaller (typically 64 MB) blocks that are spread among many machines in the cluster via the Hadoop Distributed File System (HDFS). • With a modest degree of replication, the cluster machines can read the data set in parallel and provide a much higher throughput. Moreover its cheaper than one high-end server!
  • 8. For computationally intensive work, • Most of the distributed systems are having approach of moving the data to the place where computation will take place And after the computation, the resulting data is moved back for storage. This approach works fine for computationally intensive work. For data-intensive work, • We need other better approach, Hadoop has better philosophy toward that Because Hadoop focuses on moving code/algorithm to data instead data to the code/algorithm. • The move-code-to-data philosophy applies within the Hadoop cluster itself, And data is broken up and distributed across the cluster, And computation on a piece of data takes place on the same machine where that piece of data resides. Hadoop philosophy of move-code-to-data makes more sense As we know the code/algorithm are always smaller than the Data hence code/algorithm is easier to move around. WHY HADOOP???
  • 12. Namenodes and Datanodes!!! • a namenode (the master) and a number of datanodes (workers). • The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. • Datanodes are the work horses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.
  • 13. Goals of HDFS GOALS STREAMING DATA ACCESS COMMODITY HARDWARE SIMPLE COHERENCY MODEL PORTABILITY
  • 14. MAPREDUCE!!! • Jobtracker receives map-reduce job execution request from Client. • Does sanity checks to see if the job is configured properly. • Computes the input splits. • Loads resources required for the job into HDFS • Assigns splits to tasktrackers for map and reduce phases • Map split assignment is data-locality-aware • Single point of failure • Tasktracker creates a new process for the task and executes it. • Sends periodic heartbeats to the Jobtracker, along with other information about the task.
  • 16. MapReduce data flow with multiple reduce tasks
  • 17. ADVANTAGES • Hadoop is a platform which provides Distributed storage & Computational capabilities both. • Hadoop is extremely scalable • optimized for high throughput. • HDFS uses large block sizes that ultimately helps It works best when manipulating large files • Scalability and Availability are the distinguished features of HDFS to achieve data replication and fault tolerance system. • HDFS can replicate files for specified number of times that is tolerant of software and hardware failure • Hadoop uses MapReduce framework which is a batch-based, distributed computing framework, Itallows paralleled work over a large amount of data. • MapReduce let the developers to focus on addressing business needs only, rather than getting involved in distributed system complications. • MapReduce decomposes the job into Map & Reduce tasks and schedules them for remote execution.
  • 18. DISADVANTGES • As you know Hadoop uses HDFS and MapReduce, Both of their master processes are single points of failure, Although there is active work going on for High Availability versions. • Until the Hadoop 2.x release, HDFS and MapReduce will be using single-master models which can result in single points of failure. • Security is also one of the major concern because Hadoop does offer a security model But by default it is disabled because of its high complexity. • Hadoop does not offer storage or network level encryption • HDFS is inefficient for handling small files, and it lacks transparent compression. • MapReduce is a batch-based architecture that means it does not lend itself to use cases which needs real-time data access. • MapReduce is a shared-nothing architecture hence Tasks that require global synchronization or sharing of mutable data are not a good fit .