SlideShare une entreprise Scribd logo
1  sur  23
BIG DATA and Hadoop
By
Chandra Sekhar
Contents

Introduction to BigData

What is hadoop?

What hadoop is used for and is not?

Top level Hadoop Projects

Differences between RDBMS and Hbase.

Facebook server model.
BigData- The Data Age

Big data is a collection of datasets so large and
complex that it becomes difficult to process using on-
hand database management tools or traditional data
processing applications.

The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and visualization.

The data that is getting generated by different
companies has an inherent value, can be used for
different use cases in their analytics and predictions.
A new approach

As per Moore's Law, which was true for the past 40 years.
1) Processing power doubles every two years
2) Processing speed is no longer the problem.

Getting the data to the processors becomes the bottleneck.
Average time taken to transfer 100GB of data takes 22 min, if
the disk transfer rate is 75 MB/sec

So, the new approach is to move processing of the data to the
data side in a distributed way, and need to satisfy the different
requirements like : Data Recoverability, Component Recovery,
Consistency, Reliability and Scalability.

The answer is the Google's File System(GFS) and
MapReduce, which is now Hadoop called HDFS and
MapReduce.
Hadoop used for.

Hadoop is recommended to coexist with your RDBMS as a
data ware house.

It is not a replacement to any of the RDBMS.

Processing over TB and PB of data is specified to take hours
of time with traditional methods, with Hadoop and and it's eco-
system it would take a few minutes with the power of
distribution.

Many related tools integrate with Hadoop –

Data"analysis”

Data"visualization"

Database"integration"

Workflow"management"

Cluster"management"
➲ Distributed File system and parallel processing for large scale
data operations using HDFS and MapReduce.
➲ Plus the infrastructure needed to make them work, include

Filesystem utilities

Job scheduling and monitoring

Web UI

There are many other projects running around the core
components of Hadoop. Pig, Hive, HBase, Flume, Oozie,
Sqoop, etc called as Ecosystem.

A set of machines running HDFS and MapReduce is known
as Hadoop Cluster.

Individual machines are known as nodes – A cluster
can have as few as one node, as many as several
thousands , horizontally scalable.

More nodes = better performance!
Hadoop and EcoSystem
Hadoop-Components

HDFS and MapReduce-
Core

ZooKeeper-Admin

Hive,Pig – SQL and scripts
based on MapReduce

Hbase is NoSQL Datastore.

Sqoop- import to and export
data from RDBMS.

Avro - Serialization based on
JSON. Used for metadata
store.
Hadoop Components: HDFS

HDFS, the Hadoop Distributed File System, is responsible for
storing data on the cluster. Uses Ext3/Ext4 or xfs file system.

HDFS is a file-system designed for storing very large files with
streaming data-acess(write-once, read many time), running on
clusters of commodity hardware.

Data is split into blocks and distributed across mul/ple nodes in the
cluster

Each block is typically 64MB or 128MB in size

Each block is replicated multiple times

Default is to replicate each block three times

Replicas are stored on different nodes

This ensures both reliability and availability.
HDFS and MapReduce

NameNode(Master)

SecondaryNameNode

Master FailoverNode

Data Nodes (SlaveNodes).

JobTracker

Jobs

Task Tracker

Tasks

Mapper

Reducer

Combiner

Partitioner
HDFS and Nodes
Architecture
MapReduce
HDFS Access
•
WebHDFS – REST API
•
Fuse DFS – Mounting HDFS as normal
drive.
•
Direct Access – Direct HDFS access
Hive and Pig

Hive is a powerful SQL language, though not
fully supported SQL, can be used to perform
joins on top of datasets in HDFS.

Used for large batch Programming. At the
backend, hive does the MapReduce Jobs only.

Pig is a powerful scripting language, that is
built on top of the MapReduce Jobs, the
language is called PigLatin.
HBASE

The most powerful NoSQL database on earth.

Supports Master Active-Active Setup and is
based on the Google's BigTable.

Supports Columns and ColumnFamilies, can
support many billions of rows and many
millions of columns in its datamodel.

An excellent Architectural master-piece, as far
as the scalability is concerned.

A NoSQL database, which can support
transactions, very fast reads/writes typically
millions of queries / second.
HBASE-Continued

Hbase Master

Region Servers

ZooKeepers

HDFS
ZooKeeper, Mahout

Zookeeper is a distributed coordinator and can
be used as independent package, in any
distributed servers management.

Mahout is a machine learning tool useful for
using it for various Data science techniques.
For eg: Data Clustering, Classification and
Recommender Systems by using Supervised
and Unsupervised Learning.
Flume

Flume is a real time data access mechanism
and writes to a data mart.

Flume can move large capacity of streaming
data into HDFS and will be used for further
analysis.

A part from this realtime analysis of the web-
log data is also possible along with Flume.

Logs of a group of webservers can be written
to HDFS using Flume.
Sqoop and Oozie

Sqoop is a data import and export mechanism
from RDBMS to HDFS or hive and vice-versa.

There are lot of free connectors that has been
prepared by various vendors with different
RDBMS, which has really made the data
transfer very fast, as it supports parallel
transfer of stuff.

Oozie is a workflow, mechanism of executing
a large sequence of MapReduce Jobs, Hive or
Pig Jobs and Hbase Jobs and any other Java
Programs. Oozie also has an email job which
RDBMS vs HBASE
A typical RDBMS scaling story runs this way:

Initial Public Launch

Service Popular, too many reads hitting database.

Service continues to grow in popularity; too many writes hitting
the database.

New features increases query complexity; now we have too
many joins

Rising popularity swamps the server; things are too slow

Some queries are still too slow

Reads are OK, but writes are getting slower and slower
With Hbase
Enter HBase, which has the following characteristics:

No real indexes.

Automatic partitioning/Sharding

Scale linearly and automatically with new nodes

Commodity hardware

Fault tolerance

Batch processing
Facebook Server Architecture

Contenu connexe

Tendances

Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Hadoop admin training
Hadoop admin trainingHadoop admin training
Hadoop admin trainingArun Kumar
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview EMC
 

Tendances (20)

Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop
HadoopHadoop
Hadoop
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop admin training
Hadoop admin trainingHadoop admin training
Hadoop admin training
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 

En vedette

Best practices in enterprise applications
Best practices in enterprise applicationsBest practices in enterprise applications
Best practices in enterprise applicationsChandra Sekhar Saripaka
 
On ly 2.0 project Review
On ly 2.0 project ReviewOn ly 2.0 project Review
On ly 2.0 project Reviewembian
 
Code Review - DevOn2013
Code Review - DevOn2013Code Review - DevOn2013
Code Review - DevOn2013호정 이
 
[123] quality without qa
[123] quality without qa[123] quality without qa
[123] quality without qaNAVER D2
 
Railway postcards
Railway postcardsRailway postcards
Railway postcardsMarc Bostyn
 
Question Six - What have you learned about technologies from the process of c...
Question Six - What have you learned about technologies from the process of c...Question Six - What have you learned about technologies from the process of c...
Question Six - What have you learned about technologies from the process of c...kategarner123
 
2011美國創新事業規劃研修說明
2011美國創新事業規劃研修說明2011美國創新事業規劃研修說明
2011美國創新事業規劃研修說明基欽 劉
 
DCLG Statistics User Engagement Day - Social Housing Sales and Lettings
DCLG Statistics User Engagement Day - Social Housing Sales and LettingsDCLG Statistics User Engagement Day - Social Housing Sales and Lettings
DCLG Statistics User Engagement Day - Social Housing Sales and LettingsDCLGStats
 
Sayangi bumi kita
Sayangi bumi kitaSayangi bumi kita
Sayangi bumi kitalook_0688
 
mHealth Israel: PwC emerging mhealth paths for growth
mHealth Israel: PwC emerging mhealth paths for growthmHealth Israel: PwC emerging mhealth paths for growth
mHealth Israel: PwC emerging mhealth paths for growthLevi Shapiro
 
My twitter account
My twitter accountMy twitter account
My twitter accountVtr-Netlit
 
Integrated email-marketing-and-social-media
Integrated email-marketing-and-social-mediaIntegrated email-marketing-and-social-media
Integrated email-marketing-and-social-mediaRenee Williams
 

En vedette (18)

Best practices in enterprise applications
Best practices in enterprise applicationsBest practices in enterprise applications
Best practices in enterprise applications
 
On ly 2.0 project Review
On ly 2.0 project ReviewOn ly 2.0 project Review
On ly 2.0 project Review
 
Code Review - DevOn2013
Code Review - DevOn2013Code Review - DevOn2013
Code Review - DevOn2013
 
[123] quality without qa
[123] quality without qa[123] quality without qa
[123] quality without qa
 
Febratex 2014
Febratex 2014Febratex 2014
Febratex 2014
 
Basic security(oracle)
Basic security(oracle)Basic security(oracle)
Basic security(oracle)
 
Railway postcards
Railway postcardsRailway postcards
Railway postcards
 
ZNAK NIESKOŃCZONOŚCI
ZNAK NIESKOŃCZONOŚCIZNAK NIESKOŃCZONOŚCI
ZNAK NIESKOŃCZONOŚCI
 
Question Six - What have you learned about technologies from the process of c...
Question Six - What have you learned about technologies from the process of c...Question Six - What have you learned about technologies from the process of c...
Question Six - What have you learned about technologies from the process of c...
 
2011美國創新事業規劃研修說明
2011美國創新事業規劃研修說明2011美國創新事業規劃研修說明
2011美國創新事業規劃研修說明
 
Farmacia galenica
Farmacia galenicaFarmacia galenica
Farmacia galenica
 
DCLG Statistics User Engagement Day - Social Housing Sales and Lettings
DCLG Statistics User Engagement Day - Social Housing Sales and LettingsDCLG Statistics User Engagement Day - Social Housing Sales and Lettings
DCLG Statistics User Engagement Day - Social Housing Sales and Lettings
 
Sayangi bumi kita
Sayangi bumi kitaSayangi bumi kita
Sayangi bumi kita
 
mHealth Israel: PwC emerging mhealth paths for growth
mHealth Israel: PwC emerging mhealth paths for growthmHealth Israel: PwC emerging mhealth paths for growth
mHealth Israel: PwC emerging mhealth paths for growth
 
Tarea 1
Tarea 1Tarea 1
Tarea 1
 
Sustainability Club Presn
Sustainability Club PresnSustainability Club Presn
Sustainability Club Presn
 
My twitter account
My twitter accountMy twitter account
My twitter account
 
Integrated email-marketing-and-social-media
Integrated email-marketing-and-social-mediaIntegrated email-marketing-and-social-media
Integrated email-marketing-and-social-media
 

Similaire à Hadoop presentation

Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 

Similaire à Hadoop presentation (20)

Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
HDFS
HDFSHDFS
HDFS
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Big data and tools
Big data and tools Big data and tools
Big data and tools
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Hadoop
HadoopHadoop
Hadoop
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 

Dernier

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Dernier (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Hadoop presentation

  • 1. BIG DATA and Hadoop By Chandra Sekhar
  • 3.  Introduction to BigData  What is hadoop?  What hadoop is used for and is not?  Top level Hadoop Projects  Differences between RDBMS and Hbase.  Facebook server model.
  • 4. BigData- The Data Age  Big data is a collection of datasets so large and complex that it becomes difficult to process using on- hand database management tools or traditional data processing applications.  The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.  The data that is getting generated by different companies has an inherent value, can be used for different use cases in their analytics and predictions.
  • 5. A new approach  As per Moore's Law, which was true for the past 40 years. 1) Processing power doubles every two years 2) Processing speed is no longer the problem.  Getting the data to the processors becomes the bottleneck. Average time taken to transfer 100GB of data takes 22 min, if the disk transfer rate is 75 MB/sec  So, the new approach is to move processing of the data to the data side in a distributed way, and need to satisfy the different requirements like : Data Recoverability, Component Recovery, Consistency, Reliability and Scalability.  The answer is the Google's File System(GFS) and MapReduce, which is now Hadoop called HDFS and MapReduce.
  • 6. Hadoop used for.  Hadoop is recommended to coexist with your RDBMS as a data ware house.  It is not a replacement to any of the RDBMS.  Processing over TB and PB of data is specified to take hours of time with traditional methods, with Hadoop and and it's eco- system it would take a few minutes with the power of distribution.  Many related tools integrate with Hadoop –  Data"analysis”  Data"visualization"  Database"integration"  Workflow"management"  Cluster"management"
  • 7. ➲ Distributed File system and parallel processing for large scale data operations using HDFS and MapReduce. ➲ Plus the infrastructure needed to make them work, include  Filesystem utilities  Job scheduling and monitoring  Web UI  There are many other projects running around the core components of Hadoop. Pig, Hive, HBase, Flume, Oozie, Sqoop, etc called as Ecosystem.  A set of machines running HDFS and MapReduce is known as Hadoop Cluster.  Individual machines are known as nodes – A cluster can have as few as one node, as many as several thousands , horizontally scalable.  More nodes = better performance! Hadoop and EcoSystem
  • 8. Hadoop-Components  HDFS and MapReduce- Core  ZooKeeper-Admin  Hive,Pig – SQL and scripts based on MapReduce  Hbase is NoSQL Datastore.  Sqoop- import to and export data from RDBMS.  Avro - Serialization based on JSON. Used for metadata store.
  • 9. Hadoop Components: HDFS  HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster. Uses Ext3/Ext4 or xfs file system.  HDFS is a file-system designed for storing very large files with streaming data-acess(write-once, read many time), running on clusters of commodity hardware.  Data is split into blocks and distributed across mul/ple nodes in the cluster  Each block is typically 64MB or 128MB in size  Each block is replicated multiple times  Default is to replicate each block three times  Replicas are stored on different nodes  This ensures both reliability and availability.
  • 10. HDFS and MapReduce  NameNode(Master)  SecondaryNameNode  Master FailoverNode  Data Nodes (SlaveNodes).  JobTracker  Jobs  Task Tracker  Tasks  Mapper  Reducer  Combiner  Partitioner
  • 14. HDFS Access • WebHDFS – REST API • Fuse DFS – Mounting HDFS as normal drive. • Direct Access – Direct HDFS access
  • 15. Hive and Pig  Hive is a powerful SQL language, though not fully supported SQL, can be used to perform joins on top of datasets in HDFS.  Used for large batch Programming. At the backend, hive does the MapReduce Jobs only.  Pig is a powerful scripting language, that is built on top of the MapReduce Jobs, the language is called PigLatin.
  • 16. HBASE  The most powerful NoSQL database on earth.  Supports Master Active-Active Setup and is based on the Google's BigTable.  Supports Columns and ColumnFamilies, can support many billions of rows and many millions of columns in its datamodel.  An excellent Architectural master-piece, as far as the scalability is concerned.  A NoSQL database, which can support transactions, very fast reads/writes typically millions of queries / second.
  • 18. ZooKeeper, Mahout  Zookeeper is a distributed coordinator and can be used as independent package, in any distributed servers management.  Mahout is a machine learning tool useful for using it for various Data science techniques. For eg: Data Clustering, Classification and Recommender Systems by using Supervised and Unsupervised Learning.
  • 19. Flume  Flume is a real time data access mechanism and writes to a data mart.  Flume can move large capacity of streaming data into HDFS and will be used for further analysis.  A part from this realtime analysis of the web- log data is also possible along with Flume.  Logs of a group of webservers can be written to HDFS using Flume.
  • 20. Sqoop and Oozie  Sqoop is a data import and export mechanism from RDBMS to HDFS or hive and vice-versa.  There are lot of free connectors that has been prepared by various vendors with different RDBMS, which has really made the data transfer very fast, as it supports parallel transfer of stuff.  Oozie is a workflow, mechanism of executing a large sequence of MapReduce Jobs, Hive or Pig Jobs and Hbase Jobs and any other Java Programs. Oozie also has an email job which
  • 21. RDBMS vs HBASE A typical RDBMS scaling story runs this way:  Initial Public Launch  Service Popular, too many reads hitting database.  Service continues to grow in popularity; too many writes hitting the database.  New features increases query complexity; now we have too many joins  Rising popularity swamps the server; things are too slow  Some queries are still too slow  Reads are OK, but writes are getting slower and slower
  • 22. With Hbase Enter HBase, which has the following characteristics:  No real indexes.  Automatic partitioning/Sharding  Scale linearly and automatically with new nodes  Commodity hardware  Fault tolerance  Batch processing