SlideShare une entreprise Scribd logo
1  sur  29
Introductio
nto
Dr. C.V. Suresh Babu
(CentreforKnowledgeTransfer)
institute
DiscussionTopics
• What is Hadoop?
• Need for Hadoop
• History of Hadoop
• Hadoop Overview
• Advantages and Disadvantages of Hadoop
• Hadoop Distributed File System
• Comparing: RDBMS vs. Hadoop
• Advantages and Disadvantages of HDFS
• Hadoop frameworks
• Modules of Hadoop frameworks
• Features of 'Hadoop‘
• Hadoop AnalyticsTools
(CentreforKnowledgeTransfer)
institute
What is Hadoop?
• Hadoop is an open source software programming framework for storing a
large amount of data and performing the computation.
• Its framework is based on Java programming with some native code in C and
shell scripts.
• Hadoop is used for some advanced level of analytics, which includes
Machine Learning and data mining
(CentreforKnowledgeTransfer)
institute
Need for Hadoop
• Redundant, Fault-tolerant data storage
• Parallel computation framework
• Job coordination
Programmers
Q: Where file is located?
Q: How to handle failures & data lost?
Q: How to divide computation?
Q: How to program for scaling?
No longer need to
worry about
(CentreforKnowledgeTransfer)
institute
History of Hadoop
• Apache Software Foundation is the developers of
Hadoop, and it’s co-founders are Doug Cutting and Mike
Cafarella.
• It’s co-founder Doug Cutting named it on his son’s toy
elephant. In October 2003 the first paper release was
Google File System.
• In January 2006, MapReduce development started on the
Apache Nutch which consisted of around 6000 lines
coding for it and around 5000 lines coding for HDFS.
• In April 2006 Hadoop 0.1.0 was released.
(CentreforKnowledgeTransfer)
institute
(CentreforKnowledgeTransfer)
institute
Advantages and Disadvantages of Hadoop
Advantages:
• Ability to store a large amount of
data.
• High flexibility.
• Cost effective.
• High computational power.
• Tasks are independent.
• Linear scaling.
Disadvantages:
• Not very effective for small
data.
• Hard cluster management.
• Has stability issues.
• Security concerns.
(CentreforKnowledgeTransfer)
institute
Hadoop Distributed File System
• It has distributed file system known
as HDFS and this HDFS splits files
into blocks and sends them across
various nodes in form of large
clusters.
• Also in case of a node failure, the
system operates and data transfer
takes place between the nodes
which are facilitated by HDFS.
(CentreforKnowledgeTransfer)
institute
Comparing: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query Response
Time
Can be near immediate Has latency (due to batch processing)
(CentreforKnowledgeTransfer)
institute
Advantages of HDFS:
• It is inexpensive, immutable in nature, stores data reliably, ability to tolerate
faults, scalable, block structured, can process a large amount of data
simultaneously and many more.
Disadvantages of HDFS:
• It’s the biggest disadvantage is that it is not fit for small quantities of
data. Also, it has issues related to potential stability, restrictive and
rough in nature.
(CentreforKnowledgeTransfer)
institute
Some common frameworks of Hadoop
• Hive- It uses HiveQl for data structuring and for writing complicated
MapReduce in HDFS.
• Drill- It consists of user-defined functions and is used for data exploration.
• Storm- It allows real-time processing and streaming of data.
• Spark- It contains a Machine Learning Library(MLlib) for providing
enhanced machine learning and is widely used for data processing. It also
supports Java, Python, and Scala.
• Pig- It has Pig Latin, a SQL-Like language and performs data transformation
of unstructured data.
• Tez- It reduces the complexities of Hive and Pig and helps in the running of
their codes faster.
(CentreforKnowledgeTransfer)
institute
Modules of Hadoop frameworks
Hadoop framework is made up of the following modules:
1. Hadoop MapReduce- a MapReduce programming model for handling and
processing large data.
2. Hadoop Distributed File System- distributed files in clusters among nodes.
3. HadoopYARN- a platform which manages computing resources.
4. Hadoop Common- it contains packages and libraries which are used for other
modules.
(CentreforKnowledgeTransfer)
institute
Suitable for Big Data Analysis
• As Big Data tends to be distributed and unstructured in nature, HADOOP
clusters are best suited for analysis of Big Data.
• Since it is processing logic (not the actual data) that flows to the computing
nodes, less network bandwidth is consumed.
• This concept is called as data locality concept which helps increase the
efficiency of Hadoop based applications.
Features of 'Hadoop'
(CentreforKnowledgeTransfer)
institute
Scalability
• HADOOP clusters can easily be scaled to any extent
by adding additional cluster nodes and thus allows
for the growth of Big Data.
• Also, scaling does not require modifications to
application logic.
(CentreforKnowledgeTransfer)
institute
• HADOOP ecosystem has a provision to replicate the input data on to other
cluster nodes.
• That way, in the event of a cluster node failure, data processing can still
proceed by using data stored on another cluster node.
(CentreforKnowledgeTransfer)
institute
(CentreforKnowledgeTransfer)
institute
Scales to
Petabytes or
more easily
Parallel data processing
Suited for particular types
of big data problems
Hadoop AnalyticsTools
• There is a wide range of analytical tools available in the market that help
Hadoop deal with the astronomical size data efficiently.
• Let us discuss some of the most famous and widely used tools one by one.
Below are the top 10 Hadoop analytics tools for big data.
(CentreforKnowledgeTransfer)
institute
• Apache spark in an open-source processing engine that is designed for ease of
analytics operations.
• It is a cluster computing platform that is designed to be fast and made for general
purpose uses.
• Spark is designed to cover various batch applications, Machine Learning, streaming
data processing, and interactive queries.
Features of Spark:
• In memory processing
• Tight Integration Of component
• Easy and In-expensive
• The powerful processing engine makes it so fast
• Spark Streaming has high level library for streaming process
(CentreforKnowledgeTransfer)
institute
• MapReduce is just like an Algorithm or a data structure that is based on the
YARN framework.
• The primary feature of MapReduce is to perform the distributed processing in
parallel in a Hadoop cluster, which Makes Hadoop working so fast Because
when we are dealing with Big Data, serial processing is no more of any use.
Features of Map-Reduce:
• Scalable
• FaultTolerance
• Paraller Processing
• Tunable Replication
• Load Balancing
(CentreforKnowledgeTransfer)
institute
• Apache Hive is a Data warehousing tool that is built on top of the Hadoop, and Data
Warehousing is nothing but storing the data at a fixed location generated from various
sources.
• Hive is one of the best tools used for data analysis on Hadoop.
• The one who is having knowledge of SQL can comfortably use Apache Hive.
• The query language of high is known as HQL or HIVEQL.
Features of Hive:
• Queries are similar to SQL queries.
• Hive has different storage type HBase, ORC, Plain text, etc.
• Hive has in-built function for data-mining and other works.
• Hive operates on compressed data that is present inside Hadoop Ecosystem.
(CentreforKnowledgeTransfer)
institute
• Apache Impala is an open-source SQL engine designed for Hadoop.
• Impala overcomes the speed-related issue inApache Hive with its faster-processing speed.
• Apache Impala uses similar kinds of SQL syntax, ODBC driver, and user interface as that of
Apache Hive.
• Apache Impala can easily be integrated with Hadoop for data analytics purposes.
Features of Impala:
• Easy-Integration
• Scalability
• Security
• In Memory data processing
(CentreforKnowledgeTransfer)
institute
• The name Mahout is taken from the Hindi word Mahavat which means the
elephant rider.
• Apache Mahout runs the algorithm on the top of Hadoop, so it is named Mahout.
• Mahout is mainly used for implementing various Machine Learning algorithms on
our Hadoop like classification, Collaborative filtering, Recommendation.
• Apache Mahout can implement the Machine algorithms without integration on
Hadoop.
Features of Mahout:
• Used for Machine Learning Application
• Mahout hasVector and Matrix libraries
• Ability to analyze large datasets quickly
(CentreforKnowledgeTransfer)
institute
• This Pig was Initially developed byYahoo to get ease in programming.
• Apache Pig has the capability to process an extensive dataset as it works on top of the Hadoop.
• Apache pig is used for analyzing more massive datasets by representing them as dataflow.
• Apache Pig also raises the level of abstraction for processing enormous datasets.
• Pig Latin is the scripting language that the developer uses for working on the Pig framework
that runs on Pig runtime.
Features of Pig:
• EasyTo Programme
• Rich set of operators
• Ability to handle various kind of data
• Extensibility
(CentreforKnowledgeTransfer)
institute
• HBase is nothing but a non-relational, NoSQL distributed, and column-oriented
database. HBase consists of various tables where each table has multiple numbers
of data rows.
• These rows will have multiple numbers of column family’s, and this column family
will have columns that contain key-value pairs.
• HBase works on the top of HDFS(Hadoop Distributed File System).
• We use HBase for searching small size data from the more massive datasets.
Features of HBase:
• HBase has Linear and Modular Scalability
• JAVA API can easily be used for client access
• Block cache for real time data queries
(CentreforKnowledgeTransfer)
institute
• Sqoop is a command-line tool that is developed by Apache.
• The primary purpose of Apache Sqoop is to import structured data i.e.,
RDBMS(Relational database management System) like MySQL, SQL Server,
Oracle to our HDFS(Hadoop Distributed File System).
• Sqoop can also export the data from our HDFS to RDBMS.
Features of Sqoop:
• Sqoop can Import DataTo Hive or HBase
• Connecting to database server
• Controlling parallelism
(CentreforKnowledgeTransfer)
institute
• Tableau is a data visualization software that can be used for data analytics and
business intelligence.
• It provides a variety of interactive visualization to showcase the insights of the data
and can translate the queries to visualization and can also import all ranges and
sizes of data.
• Tableau offers rapid analysis and processing, so it Generates useful visualizing
charts on interactive dashboards and worksheets.
Features ofTableu:
• Tableau supports Bar chart, Histogram, Pie chart, Motion chart, Bullet chart, Gantt
chart and so many
• Secure and Robust
• Interactive Dashboard and worksheets
(CentreforKnowledgeTransfer)
institute
• Apache Storm is a free open source distributed real-time computation system build using
Programming languages like Clojure and java.
• It can be used with many programming languages.
• Apache Storm is used for the Streaming process, which is very faster.
• We use Daemons like Nimbus, Zookeeper, and Supervisor inApache Storm.
• Apache Storm can be used for real-time processing, online Machine learning, and many
more. Companies likeYahoo, Spotify,Twitter, and so many uses Apache Storm.
Features of Storm:
• Easily operatable
• each node can process millions of tuples in one second
• Scalable and FaultTolerance
(CentreforKnowledgeTransfer)
institute
Companies Using Hadoop
(CentreforKnowledgeTransfer)
institute
Common Hadoop Distributions
(CentreforKnowledgeTransfer)
institute
• Open Source
• Apache
• Commercial
• Cloudera
• Hortonworks
• MapR
• AWS MapReduce
• MicrosoftAzure HDInsight (Beta)

Contenu connexe

Tendances

Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...Simplilearn
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBRavi Teja
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 

Tendances (20)

PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Hive
HiveHive
Hive
 
Data streaming fundamentals
Data streaming fundamentalsData streaming fundamentals
Data streaming fundamentals
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Big data unit i
Big data unit iBig data unit i
Big data unit i
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 

Similaire à Introduction to Hadoop

Bdm hadoop ecosystem
Bdm hadoop ecosystemBdm hadoop ecosystem
Bdm hadoop ecosystemAmit Bhardwaj
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsNetajiGandi1
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in AzureMostafa
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in AzureMostafa
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxraghavanand36
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction葵慶 李
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdataTom Rogers
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxiaeronlineexm
 
Big data solutions in azure
Big data solutions in azureBig data solutions in azure
Big data solutions in azureMostafa
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 

Similaire à Introduction to Hadoop (20)

Bdm hadoop ecosystem
Bdm hadoop ecosystemBdm hadoop ecosystem
Bdm hadoop ecosystem
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in Azure
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
 
Hadoop
HadoopHadoop
Hadoop
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptx
 
Big data solutions in azure
Big data solutions in azureBig data solutions in azure
Big data solutions in azure
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 

Plus de Dr. C.V. Suresh Babu (20)

Data analytics with R
Data analytics with RData analytics with R
Data analytics with R
 
Association rules
Association rulesAssociation rules
Association rules
 
Clustering
ClusteringClustering
Clustering
 
Classification
ClassificationClassification
Classification
 
Blue property assumptions.
Blue property assumptions.Blue property assumptions.
Blue property assumptions.
 
Introduction to regression
Introduction to regressionIntroduction to regression
Introduction to regression
 
DART
DARTDART
DART
 
Mycin
MycinMycin
Mycin
 
Expert systems
Expert systemsExpert systems
Expert systems
 
Dempster shafer theory
Dempster shafer theoryDempster shafer theory
Dempster shafer theory
 
Bayes network
Bayes networkBayes network
Bayes network
 
Bayes' theorem
Bayes' theoremBayes' theorem
Bayes' theorem
 
Knowledge based agents
Knowledge based agentsKnowledge based agents
Knowledge based agents
 
Rule based system
Rule based systemRule based system
Rule based system
 
Formal Logic in AI
Formal Logic in AIFormal Logic in AI
Formal Logic in AI
 
Production based system
Production based systemProduction based system
Production based system
 
Game playing in AI
Game playing in AIGame playing in AI
Game playing in AI
 
Diagnosis test of diabetics and hypertension by AI
Diagnosis test of diabetics and hypertension by AIDiagnosis test of diabetics and hypertension by AI
Diagnosis test of diabetics and hypertension by AI
 
A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”
 
A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”
 

Dernier

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Dernier (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

Introduction to Hadoop

  • 1. Introductio nto Dr. C.V. Suresh Babu (CentreforKnowledgeTransfer) institute
  • 2. DiscussionTopics • What is Hadoop? • Need for Hadoop • History of Hadoop • Hadoop Overview • Advantages and Disadvantages of Hadoop • Hadoop Distributed File System • Comparing: RDBMS vs. Hadoop • Advantages and Disadvantages of HDFS • Hadoop frameworks • Modules of Hadoop frameworks • Features of 'Hadoop‘ • Hadoop AnalyticsTools (CentreforKnowledgeTransfer) institute
  • 3. What is Hadoop? • Hadoop is an open source software programming framework for storing a large amount of data and performing the computation. • Its framework is based on Java programming with some native code in C and shell scripts. • Hadoop is used for some advanced level of analytics, which includes Machine Learning and data mining (CentreforKnowledgeTransfer) institute
  • 4. Need for Hadoop • Redundant, Fault-tolerant data storage • Parallel computation framework • Job coordination Programmers Q: Where file is located? Q: How to handle failures & data lost? Q: How to divide computation? Q: How to program for scaling? No longer need to worry about (CentreforKnowledgeTransfer) institute
  • 5. History of Hadoop • Apache Software Foundation is the developers of Hadoop, and it’s co-founders are Doug Cutting and Mike Cafarella. • It’s co-founder Doug Cutting named it on his son’s toy elephant. In October 2003 the first paper release was Google File System. • In January 2006, MapReduce development started on the Apache Nutch which consisted of around 6000 lines coding for it and around 5000 lines coding for HDFS. • In April 2006 Hadoop 0.1.0 was released. (CentreforKnowledgeTransfer) institute
  • 7. Advantages and Disadvantages of Hadoop Advantages: • Ability to store a large amount of data. • High flexibility. • Cost effective. • High computational power. • Tasks are independent. • Linear scaling. Disadvantages: • Not very effective for small data. • Hard cluster management. • Has stability issues. • Security concerns. (CentreforKnowledgeTransfer) institute
  • 8. Hadoop Distributed File System • It has distributed file system known as HDFS and this HDFS splits files into blocks and sends them across various nodes in form of large clusters. • Also in case of a node failure, the system operates and data transfer takes place between the nodes which are facilitated by HDFS. (CentreforKnowledgeTransfer) institute
  • 9. Comparing: RDBMS vs. Hadoop Traditional RDBMS Hadoop / MapReduce Data Size Gigabytes (Terabytes) Petabytes (Hexabytes) Access Interactive and Batch Batch – NOT Interactive Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear Query Response Time Can be near immediate Has latency (due to batch processing) (CentreforKnowledgeTransfer) institute
  • 10. Advantages of HDFS: • It is inexpensive, immutable in nature, stores data reliably, ability to tolerate faults, scalable, block structured, can process a large amount of data simultaneously and many more. Disadvantages of HDFS: • It’s the biggest disadvantage is that it is not fit for small quantities of data. Also, it has issues related to potential stability, restrictive and rough in nature. (CentreforKnowledgeTransfer) institute
  • 11. Some common frameworks of Hadoop • Hive- It uses HiveQl for data structuring and for writing complicated MapReduce in HDFS. • Drill- It consists of user-defined functions and is used for data exploration. • Storm- It allows real-time processing and streaming of data. • Spark- It contains a Machine Learning Library(MLlib) for providing enhanced machine learning and is widely used for data processing. It also supports Java, Python, and Scala. • Pig- It has Pig Latin, a SQL-Like language and performs data transformation of unstructured data. • Tez- It reduces the complexities of Hive and Pig and helps in the running of their codes faster. (CentreforKnowledgeTransfer) institute
  • 12. Modules of Hadoop frameworks Hadoop framework is made up of the following modules: 1. Hadoop MapReduce- a MapReduce programming model for handling and processing large data. 2. Hadoop Distributed File System- distributed files in clusters among nodes. 3. HadoopYARN- a platform which manages computing resources. 4. Hadoop Common- it contains packages and libraries which are used for other modules. (CentreforKnowledgeTransfer) institute
  • 13. Suitable for Big Data Analysis • As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are best suited for analysis of Big Data. • Since it is processing logic (not the actual data) that flows to the computing nodes, less network bandwidth is consumed. • This concept is called as data locality concept which helps increase the efficiency of Hadoop based applications. Features of 'Hadoop' (CentreforKnowledgeTransfer) institute
  • 14. Scalability • HADOOP clusters can easily be scaled to any extent by adding additional cluster nodes and thus allows for the growth of Big Data. • Also, scaling does not require modifications to application logic. (CentreforKnowledgeTransfer) institute
  • 15. • HADOOP ecosystem has a provision to replicate the input data on to other cluster nodes. • That way, in the event of a cluster node failure, data processing can still proceed by using data stored on another cluster node. (CentreforKnowledgeTransfer) institute
  • 16. (CentreforKnowledgeTransfer) institute Scales to Petabytes or more easily Parallel data processing Suited for particular types of big data problems
  • 17. Hadoop AnalyticsTools • There is a wide range of analytical tools available in the market that help Hadoop deal with the astronomical size data efficiently. • Let us discuss some of the most famous and widely used tools one by one. Below are the top 10 Hadoop analytics tools for big data. (CentreforKnowledgeTransfer) institute
  • 18. • Apache spark in an open-source processing engine that is designed for ease of analytics operations. • It is a cluster computing platform that is designed to be fast and made for general purpose uses. • Spark is designed to cover various batch applications, Machine Learning, streaming data processing, and interactive queries. Features of Spark: • In memory processing • Tight Integration Of component • Easy and In-expensive • The powerful processing engine makes it so fast • Spark Streaming has high level library for streaming process (CentreforKnowledgeTransfer) institute
  • 19. • MapReduce is just like an Algorithm or a data structure that is based on the YARN framework. • The primary feature of MapReduce is to perform the distributed processing in parallel in a Hadoop cluster, which Makes Hadoop working so fast Because when we are dealing with Big Data, serial processing is no more of any use. Features of Map-Reduce: • Scalable • FaultTolerance • Paraller Processing • Tunable Replication • Load Balancing (CentreforKnowledgeTransfer) institute
  • 20. • Apache Hive is a Data warehousing tool that is built on top of the Hadoop, and Data Warehousing is nothing but storing the data at a fixed location generated from various sources. • Hive is one of the best tools used for data analysis on Hadoop. • The one who is having knowledge of SQL can comfortably use Apache Hive. • The query language of high is known as HQL or HIVEQL. Features of Hive: • Queries are similar to SQL queries. • Hive has different storage type HBase, ORC, Plain text, etc. • Hive has in-built function for data-mining and other works. • Hive operates on compressed data that is present inside Hadoop Ecosystem. (CentreforKnowledgeTransfer) institute
  • 21. • Apache Impala is an open-source SQL engine designed for Hadoop. • Impala overcomes the speed-related issue inApache Hive with its faster-processing speed. • Apache Impala uses similar kinds of SQL syntax, ODBC driver, and user interface as that of Apache Hive. • Apache Impala can easily be integrated with Hadoop for data analytics purposes. Features of Impala: • Easy-Integration • Scalability • Security • In Memory data processing (CentreforKnowledgeTransfer) institute
  • 22. • The name Mahout is taken from the Hindi word Mahavat which means the elephant rider. • Apache Mahout runs the algorithm on the top of Hadoop, so it is named Mahout. • Mahout is mainly used for implementing various Machine Learning algorithms on our Hadoop like classification, Collaborative filtering, Recommendation. • Apache Mahout can implement the Machine algorithms without integration on Hadoop. Features of Mahout: • Used for Machine Learning Application • Mahout hasVector and Matrix libraries • Ability to analyze large datasets quickly (CentreforKnowledgeTransfer) institute
  • 23. • This Pig was Initially developed byYahoo to get ease in programming. • Apache Pig has the capability to process an extensive dataset as it works on top of the Hadoop. • Apache pig is used for analyzing more massive datasets by representing them as dataflow. • Apache Pig also raises the level of abstraction for processing enormous datasets. • Pig Latin is the scripting language that the developer uses for working on the Pig framework that runs on Pig runtime. Features of Pig: • EasyTo Programme • Rich set of operators • Ability to handle various kind of data • Extensibility (CentreforKnowledgeTransfer) institute
  • 24. • HBase is nothing but a non-relational, NoSQL distributed, and column-oriented database. HBase consists of various tables where each table has multiple numbers of data rows. • These rows will have multiple numbers of column family’s, and this column family will have columns that contain key-value pairs. • HBase works on the top of HDFS(Hadoop Distributed File System). • We use HBase for searching small size data from the more massive datasets. Features of HBase: • HBase has Linear and Modular Scalability • JAVA API can easily be used for client access • Block cache for real time data queries (CentreforKnowledgeTransfer) institute
  • 25. • Sqoop is a command-line tool that is developed by Apache. • The primary purpose of Apache Sqoop is to import structured data i.e., RDBMS(Relational database management System) like MySQL, SQL Server, Oracle to our HDFS(Hadoop Distributed File System). • Sqoop can also export the data from our HDFS to RDBMS. Features of Sqoop: • Sqoop can Import DataTo Hive or HBase • Connecting to database server • Controlling parallelism (CentreforKnowledgeTransfer) institute
  • 26. • Tableau is a data visualization software that can be used for data analytics and business intelligence. • It provides a variety of interactive visualization to showcase the insights of the data and can translate the queries to visualization and can also import all ranges and sizes of data. • Tableau offers rapid analysis and processing, so it Generates useful visualizing charts on interactive dashboards and worksheets. Features ofTableu: • Tableau supports Bar chart, Histogram, Pie chart, Motion chart, Bullet chart, Gantt chart and so many • Secure and Robust • Interactive Dashboard and worksheets (CentreforKnowledgeTransfer) institute
  • 27. • Apache Storm is a free open source distributed real-time computation system build using Programming languages like Clojure and java. • It can be used with many programming languages. • Apache Storm is used for the Streaming process, which is very faster. • We use Daemons like Nimbus, Zookeeper, and Supervisor inApache Storm. • Apache Storm can be used for real-time processing, online Machine learning, and many more. Companies likeYahoo, Spotify,Twitter, and so many uses Apache Storm. Features of Storm: • Easily operatable • each node can process millions of tuples in one second • Scalable and FaultTolerance (CentreforKnowledgeTransfer) institute
  • 29. Common Hadoop Distributions (CentreforKnowledgeTransfer) institute • Open Source • Apache • Commercial • Cloudera • Hortonworks • MapR • AWS MapReduce • MicrosoftAzure HDInsight (Beta)