SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
Big Data and
Hadoop
PARKHE KISHOR B.
M. TECH. (INDUSTRIAL MATHEMATICS AND COMPUTER APPLICATIONS)
! Prerequisites
JAVA
! OOPs concepts
! Serialization
! Data Structures (Hash Map, Lists)
! FILE I/O
! UNIX Commands (mv, cp, ls etc, mkdir, ps, vi)
Development Environment
! Install jdk 1.6, jre 6, eclipse
! What is Big Data?
! Typically we work on excel sheet, ppt, word docs, code files. They are of the order 1-2Mb. Even
a movie is just 1–2 Gb size.
! The BIG DATA we want to deal with, is of the order of Petabytes.
! 10^12 times size of ordinary files.
What Happens in An Internet Minute?
Where is this Data ?
! This data is generated from multiple sources. The data that goes in the logs of google,
facebook, linkedin, yahoo servers is of billion users of all around the world.
! What are users accessing, how long the user remains in site. All the meta data sites visited,
friend's list, status. Torrent downloads In Every 5 minutes granularity google gets Petabytes
information in it's server logs. Same goes for facebook, yahoo, AT&T, Airtel.
Why need to understand data?
1. Analytics
2. Why we need analytics?
Use Case of analytics.
1. How effectively you do business
2. Cost cutting and leverage productivity
3. Google, Amazon, Ebay they get logs so that ads and products can be recommended to
customers.
e.g. Public transport
Data Categories
1. Structure Data
2. Unstructured Data
Challenges
Problem is not getting this big data. Problem is how to store, process and analyze
this data.








Case Study
Telecom Company
1. Airmobile (50 million subscribers) wants to sell it’s expensive $500 monthly plan
to it’s customers, for this it wants to find out its top subscribers and the total
bytes they have downloaded(Internet data) using it’s services in last one
month.
2. Also it wants to advertise it’s roaming plan of $100 so that subscribers don’t
switch to other networks, when going to other cities. For this it wants to find
out the minutes of usage (i.e., call duration) of top 10 thousand subscribers
who have roamed in last 1 month .
Issues
1. Different subscribers in different cities have different data plans. Almost all the
subscribers are active each day of month.
2. 2. Data collection is huge every minute almost 1 million people visit 5–6 sites.
3. 3. Every day tera bytes of information is collected in airmobile servers of
each city, which get discarded because of unavailability of storage.
Solution
1. Introduced by Google was GFS (Google file system) and Map Reduce.
2. Then Hadoop became open source and now is owned by apache.
3. Hadoop is used by Facebook, Yahoo, Google, Twitter, linkedin, Rackspace.
How is HADOOP the Solution?
1. Storage -> HDFS A distributed file system where commodity hardware can be
used to form clusters and store the huge data in distributed fashion. There is
no need for high end hardwares.
2. Process -> MAP Reduce Paradigm
3. Analyze -> Hive, Pig MapReduce.
4. It can easily scale to multiple nodes(1,500–2,000 nodes in a cluster), with just
configuration change.
Applications of HADOOP
1. Telecommunications -> To find out top subscribers for advertisement, find
peak traffic rate to install routers at right places, for cost cutting.
2. Recommendation systems -> Google Ads customized for all users.
3. Data warehousing -> to store data and analyze it e.g., categorize data into
http web or mobile, so that services by ISP can be customized accordingly.
4. Market Research and Forecasting -> Forecast subscribers, traffic based on
past data trend.
5. Finance, social networking -> To predict trends and gain profit.
What it is Not
1. Should be noted it's not OLAP (online analytical Processing) but batch /
offline oriented
2. It is not a database
Challenges
Can this data be stored in 1 machine? Hard drives are approximately 500Gb
in size. Even if you add external hard drives, you can't store the data in Peta
bytes. Let's say you add external hard drives and store this data, you wouldn't be
able to open or process that file because of insufficient RAM. And processing, it
would take months to analyze this data.
HDFS Features
1. Data is distributed over several machines, and replicated to ensure their
durability to failure and high availability to parallel applications
2. Designed for very large files (in GBs, TBs)
3. Block oriented
4. Unix like commands interface
5. Write once and read many times
6. Commodity hardware
7. Fault Tolerant when nodes fail
8. Scalable by adding new nodes
It is Not Designed for
1. Small files.
2. Multiple writes, arbitrary file modification -> Writes are always supported at
the end of the file, modifications can't be made at random offsets of files.
3. Low latency data access -> Since we are accessing huge amount of data it
comes at the expense of time taken to access the data.
Architecture
! Production
! Hadoop Component
Function of Name Node
1. Name Node is controller and manager of HDFS. It knows the status and the
metadata of all the files in HDFS.
2. Metadata is -> file names, permissions, and locations of each block of file
3. HDFS cluster can be accessed concurrently by multiple clients, even then this
metadata information is never desynchronized. Hence, all this information is
handled by a single machine.
4. Since metadata is typically small, all this info. is stored in main memory of
Name Node, allowing fast access to metadata.
Purpose of Secondary Name Node
1. It is not backup of name node nor data nodes connect to this. It is just a
helper of name node.
2. It only performs periodic checkpoints.
3. It communicates with name node and to take snapshots of HDFS metadata.
4. These snapshots help minimize downtime and loss of data.
HDFS Read
HDFS Write
Replica Placement

Contenu connexe

Tendances

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopRojaT4
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopAvkash Chauhan
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core conceptsMaryan Faryna
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1Abbas Maazallahi
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irdatastack
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Imviplav
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1RojaT4
 

Tendances (19)

Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache Hadoop
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.ir
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
 

En vedette

Jute rc
Jute rcJute rc
Jute rcGre Tp
 
Get expertise with mongo db
Get expertise with mongo dbGet expertise with mongo db
Get expertise with mongo dbAmit Thakkar
 
70 533 implementing microsoft azure infrastructure solutions Preparation Guide
70 533 implementing microsoft azure infrastructure solutions Preparation Guide70 533 implementing microsoft azure infrastructure solutions Preparation Guide
70 533 implementing microsoft azure infrastructure solutions Preparation GuideMahesh Dahal
 
Object oriented analysis and design
Object oriented analysis and designObject oriented analysis and design
Object oriented analysis and designPrem Aseem Jain
 
Die antenna help uitsaai. Artikel in Die Burger
Die antenna help uitsaai. Artikel in Die BurgerDie antenna help uitsaai. Artikel in Die Burger
Die antenna help uitsaai. Artikel in Die BurgerAndre Fourie
 
Introducing with MongoDB
Introducing with MongoDBIntroducing with MongoDB
Introducing with MongoDBMahbub Tito
 
Getfly crm: Giải pháp quản lý khách hàng & Phát triển phòng kinh doanh
Getfly crm: Giải pháp quản lý khách hàng & Phát triển phòng kinh doanhGetfly crm: Giải pháp quản lý khách hàng & Phát triển phòng kinh doanh
Getfly crm: Giải pháp quản lý khách hàng & Phát triển phòng kinh doanhHoang Nguyen
 
Hadoop Architecture
Hadoop Architecture Hadoop Architecture
Hadoop Architecture Ganesh B
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Big Data Spain
 

En vedette (16)

Jute rc
Jute rcJute rc
Jute rc
 
Redis
RedisRedis
Redis
 
Get expertise with mongo db
Get expertise with mongo dbGet expertise with mongo db
Get expertise with mongo db
 
Unit 1.2 demo
Unit 1.2 demoUnit 1.2 demo
Unit 1.2 demo
 
70 533 implementing microsoft azure infrastructure solutions Preparation Guide
70 533 implementing microsoft azure infrastructure solutions Preparation Guide70 533 implementing microsoft azure infrastructure solutions Preparation Guide
70 533 implementing microsoft azure infrastructure solutions Preparation Guide
 
AWS Certified Solutions Architect
AWS Certified Solutions ArchitectAWS Certified Solutions Architect
AWS Certified Solutions Architect
 
Object oriented analysis and design
Object oriented analysis and designObject oriented analysis and design
Object oriented analysis and design
 
Die antenna help uitsaai. Artikel in Die Burger
Die antenna help uitsaai. Artikel in Die BurgerDie antenna help uitsaai. Artikel in Die Burger
Die antenna help uitsaai. Artikel in Die Burger
 
Introducing with MongoDB
Introducing with MongoDBIntroducing with MongoDB
Introducing with MongoDB
 
CVE
CVECVE
CVE
 
Getfly crm: Giải pháp quản lý khách hàng & Phát triển phòng kinh doanh
Getfly crm: Giải pháp quản lý khách hàng & Phát triển phòng kinh doanhGetfly crm: Giải pháp quản lý khách hàng & Phát triển phòng kinh doanh
Getfly crm: Giải pháp quản lý khách hàng & Phát triển phòng kinh doanh
 
Avro
AvroAvro
Avro
 
Hadoop Architecture
Hadoop Architecture Hadoop Architecture
Hadoop Architecture
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
 
MongoDB
MongoDBMongoDB
MongoDB
 
Indexing In MongoDB
Indexing In MongoDBIndexing In MongoDB
Indexing In MongoDB
 

Similaire à Big data and hadoop

New Framework for Improving Bigdata Analaysis Using Mobile Agent
New Framework for Improving Bigdata Analaysis Using Mobile AgentNew Framework for Improving Bigdata Analaysis Using Mobile Agent
New Framework for Improving Bigdata Analaysis Using Mobile AgentMohammed Adam
 
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce ijujournal
 
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCEHMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCEijujournal
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformGeekNightHyderabad
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniquesijsrd.com
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterShivraj Raj
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questionsKalyan Hadoop
 

Similaire à Big data and hadoop (20)

New Framework for Improving Bigdata Analaysis Using Mobile Agent
New Framework for Improving Bigdata Analaysis Using Mobile AgentNew Framework for Improving Bigdata Analaysis Using Mobile Agent
New Framework for Improving Bigdata Analaysis Using Mobile Agent
 
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
 
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCEHMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
final report
final reportfinal report
final report
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop cluster
 
paper
paperpaper
paper
 
Big Data
Big DataBig Data
Big Data
 
Big data
Big dataBig data
Big data
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 

Dernier

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Dernier (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Big data and hadoop

  • 1. Big Data and Hadoop PARKHE KISHOR B. M. TECH. (INDUSTRIAL MATHEMATICS AND COMPUTER APPLICATIONS)
  • 2. ! Prerequisites JAVA ! OOPs concepts ! Serialization ! Data Structures (Hash Map, Lists) ! FILE I/O ! UNIX Commands (mv, cp, ls etc, mkdir, ps, vi)
  • 3. Development Environment ! Install jdk 1.6, jre 6, eclipse
  • 4. ! What is Big Data? ! Typically we work on excel sheet, ppt, word docs, code files. They are of the order 1-2Mb. Even a movie is just 1–2 Gb size.
  • 5. ! The BIG DATA we want to deal with, is of the order of Petabytes. ! 10^12 times size of ordinary files.
  • 6. What Happens in An Internet Minute?
  • 7. Where is this Data ? ! This data is generated from multiple sources. The data that goes in the logs of google, facebook, linkedin, yahoo servers is of billion users of all around the world. ! What are users accessing, how long the user remains in site. All the meta data sites visited, friend's list, status. Torrent downloads In Every 5 minutes granularity google gets Petabytes information in it's server logs. Same goes for facebook, yahoo, AT&T, Airtel. Why need to understand data? 1. Analytics 2. Why we need analytics?
  • 8. Use Case of analytics. 1. How effectively you do business 2. Cost cutting and leverage productivity 3. Google, Amazon, Ebay they get logs so that ads and products can be recommended to customers. e.g. Public transport
  • 9. Data Categories 1. Structure Data 2. Unstructured Data
  • 10. Challenges Problem is not getting this big data. Problem is how to store, process and analyze this data.
  • 12. Telecom Company 1. Airmobile (50 million subscribers) wants to sell it’s expensive $500 monthly plan to it’s customers, for this it wants to find out its top subscribers and the total bytes they have downloaded(Internet data) using it’s services in last one month. 2. Also it wants to advertise it’s roaming plan of $100 so that subscribers don’t switch to other networks, when going to other cities. For this it wants to find out the minutes of usage (i.e., call duration) of top 10 thousand subscribers who have roamed in last 1 month .
  • 13. Issues 1. Different subscribers in different cities have different data plans. Almost all the subscribers are active each day of month. 2. 2. Data collection is huge every minute almost 1 million people visit 5–6 sites. 3. 3. Every day tera bytes of information is collected in airmobile servers of each city, which get discarded because of unavailability of storage.
  • 14. Solution 1. Introduced by Google was GFS (Google file system) and Map Reduce. 2. Then Hadoop became open source and now is owned by apache. 3. Hadoop is used by Facebook, Yahoo, Google, Twitter, linkedin, Rackspace.
  • 15. How is HADOOP the Solution? 1. Storage -> HDFS A distributed file system where commodity hardware can be used to form clusters and store the huge data in distributed fashion. There is no need for high end hardwares. 2. Process -> MAP Reduce Paradigm 3. Analyze -> Hive, Pig MapReduce. 4. It can easily scale to multiple nodes(1,500–2,000 nodes in a cluster), with just configuration change.
  • 16.
  • 17. Applications of HADOOP 1. Telecommunications -> To find out top subscribers for advertisement, find peak traffic rate to install routers at right places, for cost cutting. 2. Recommendation systems -> Google Ads customized for all users. 3. Data warehousing -> to store data and analyze it e.g., categorize data into http web or mobile, so that services by ISP can be customized accordingly. 4. Market Research and Forecasting -> Forecast subscribers, traffic based on past data trend. 5. Finance, social networking -> To predict trends and gain profit.
  • 18. What it is Not 1. Should be noted it's not OLAP (online analytical Processing) but batch / offline oriented 2. It is not a database
  • 19. Challenges Can this data be stored in 1 machine? Hard drives are approximately 500Gb in size. Even if you add external hard drives, you can't store the data in Peta bytes. Let's say you add external hard drives and store this data, you wouldn't be able to open or process that file because of insufficient RAM. And processing, it would take months to analyze this data.
  • 20. HDFS Features 1. Data is distributed over several machines, and replicated to ensure their durability to failure and high availability to parallel applications 2. Designed for very large files (in GBs, TBs) 3. Block oriented 4. Unix like commands interface 5. Write once and read many times 6. Commodity hardware 7. Fault Tolerant when nodes fail 8. Scalable by adding new nodes
  • 21. It is Not Designed for 1. Small files. 2. Multiple writes, arbitrary file modification -> Writes are always supported at the end of the file, modifications can't be made at random offsets of files. 3. Low latency data access -> Since we are accessing huge amount of data it comes at the expense of time taken to access the data.
  • 25. Function of Name Node 1. Name Node is controller and manager of HDFS. It knows the status and the metadata of all the files in HDFS. 2. Metadata is -> file names, permissions, and locations of each block of file 3. HDFS cluster can be accessed concurrently by multiple clients, even then this metadata information is never desynchronized. Hence, all this information is handled by a single machine. 4. Since metadata is typically small, all this info. is stored in main memory of Name Node, allowing fast access to metadata.
  • 26. Purpose of Secondary Name Node 1. It is not backup of name node nor data nodes connect to this. It is just a helper of name node. 2. It only performs periodic checkpoints. 3. It communicates with name node and to take snapshots of HDFS metadata. 4. These snapshots help minimize downtime and loss of data.