SlideShare une entreprise Scribd logo
1  sur  24
Hadoop
in a
Nutshell
Siva Pandeti
Cloudera Certified Developer for Apache Hadoop (CCDH)
Overview
Why
Hadoop?
What is
Hadoop?
How to
Hadoop?
Examples
Data Growth
What is Big Data?
Hadoop usage
Components
No SQL
Cluster
Vendors
Tool Comparison
Typical Implementation
Data Analysis with Pig & Hive
Opportunities
Map Reduce deep dive
Wordcount
Search index
Recommendation Engine
Why
Hadoop?
Data Growth
OLTP
Databases for
Operations
Throw away
historical data
Relational
Oracle, DB2
OLAP
Data warehouses for
analytics
Cheaper centralized
storage -> Data
warehouses
(ETL tools)
Relational/MPP
appliances
< few hundred TB
Big Data
Data explosion
(social media, etc)
Petabyte scale
Network speeds
haven’t increased
Need Data Locality
Distributed
processing on
commodity
hardware
(Hadoop)
Non-relational
Big Data
What is Big Data?
Volume
Petabyte scale
Variety
Structured
Semi-structured
Unstructured
Velocity
Social
Sensor
Throughput
Veracity
Unclean
Imprecise
Unclear
Where is Hadoop Used?
Industry
Technology
Use Cases
Search
People you may know
Movie recommendations
Banks
Fraud Detection
Regulatory
Risk management
Media
Retail
Marketing analytics
Customer service
Product recommendations
Manufacturing Preventive maintenance
What is
Hadoop?
Hadoop
HDFS
Distributed Storage
Economical: commodity hardware
Scalable: rebalances data on new nodes
Fault Tolerant: detects faults & auto recovers
Reliable: maintains multiple copies of data
High throughput: because data is distributed
Open source
distributed
computing
framework for
storage and
processing
What is Hadoop?
MapReduce
Distributed Processing
Data Locality: process where the data resides
Fault Tolerant: auto-recover job failures
Scalable: add nodes to increase parallelism
Economical: commodity hardware
• Unlike RDBMS:
o De-normalized
o No secondary indexes
o No transactions
• Modeled after Google’s Big Table
• Random real time read/write access to Big Data
• Billions of rows x millions of columns
• Commodity hardware
• Open source, distributed, versioned, column oriented
• Integrates with MapReduce; Has Java/REST APIs
• Automatic sharding
NoSQL DBs - HBase
Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
Master Node
Slave Node Slave Node Slave Node
Job Tracker
Task Tracker Task Tracker Task Tracker
Name Node
Data Node Data Node Data Node
Cluster
How Does Hadoop Work?
Vendors
Apache
Hadoop
Cloudera
HortonWorks
MapR
Pentaho Informatica
Talend Clover
EMR
ETL/BI Connectors
Hadoop Distributors
Microstrategy Tableau
SASAbInitio
Comparison
Traditional ETL/BI
Expensive license
Expensive hardware
Hadoop
Open source
Cheap commodity hardware
< 100 TB
Central storage
Petabyte scale
Distributed storage
CostVolume
Quick response for processing
small data
Not as fast on large data
Even smallest job takes 15 seconds
Super fast on large data
Speed
Thousands of reads/writes per
minute
Millions of reads/writes per
minute
Thruput
How to
Hadoop?
HDFS
Hadoop
Flume
Sqoop
Ingest
Put/Get
ETL tools
RDBMS
Data
Feeds
Files
Hadoop Implementation
Reports Machine
Learning
Output
Analytics
Visualization
SAS R
MapReduce
Pig Hive Mahout
Process
Data Analysis: Pig & Hive
Pig Hive
Abstraction on top of MapReduce. Generates MapReduce jobs in the
backend. Useful for analysts who are not programmers.
Data flow language
No schema
Better with less structured Data
SQL like language
Schema, tables, joins are stored in
a meta-store.
Example
LOAD ‘file’ USING
PigStorage(‘t’) AS (id, name);
FILTER
FOREACH
GROUP
ORDER
STORE
Example
CREATE TABLE customer (id
INT, name STRING) ROW
FORMAT DELIMITED FIELDS
TERMINATED BY ‘t’;
SELECT * from customer
WHERE id < 100 limit 10;
MapReduce
Source: http://www.rabidgremlin.com/data20/MapReduceWordCountOverview1.png
Examples
Word count - Java
• Copy input files to HDFS
o hadoop fs –put file1.txt input
• Create driver
o Set configuration variables, mapper and reducer class names
• Create mapper
o Read input and emit key value pairs
• Create reducer (optional)
o Aggregate all values for a particular key
• Execute
o hadoop jar WordCount.jar WordCount input output
• Analyze output
o hadoop fs –cat output/* | head
Word count - Streaming
• Hadoop is written in Java. I don’t know Java. What
do I do?
o Hadoop Streaming (Python, Ruby, R, etc)
• Copy input files to HDFS
o hadoop fs –put file1.txt input
• Create mapper
o Read input stream (stdin) and emit (print) key value pairs
• Create reducer (optional)
o Aggregate all values for a particular key
• Execute
o hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-
stream*.jar -mapper mapper.py –file mapper.py -reducer reducer.py –file
reducer.py -input input –output output
• Analyze output
o hdoop fs –cat output/* | head
Hadoop for R
Sys.setenv(HADOOP_HOME="/home/istvan/hadoop")
Sys.setenv(HADOOP_CMD="/home/istvan/hadoop/bin/hadoop")
library(rmr2)
library(rhdfs)
setwd("/home/istvan/rhadoop/blogs/")
gdp <- read.csv("GDP_converted.csv")
head(gdp)
hdfs.init()
gdp.values <- to.dfs(gdp)
# AAPL revenue in 2012 in millions USD
aaplRevenue = 156508
gdp.map.fn <- function(k,v) {
key <- ifelse(v[4] < aaplRevenue, "less", "greater")
keyval(key, 1)
}
count.reduce.fn <- function(k,v) {
keyval(k, length(v))
}
count <- mapreduce(input=gdp.values,
map = gdp.map.fn,
reduce = count.reduce.fn)
from.dfs(count)
• RHadoop package
o rmr
o rhdfs
o Rhbase
• Uses Hadoop
Streaming
• Example on the right
determines how
many countries
have greater GDP
than Apple
Source: http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/
Search index example
• Crawl web
o Crawl and save websites to local directory
• Ingest files to HDFS
• Map
o Split the words & associate words with file names
• Reduce
o Build an index with words and files & count of occurrences
• Search
o Pass the word to the index to get the files it shows up in. Display the file
listing in descending order of number of occurrences of the word in a file
Recommender example
• Use web server logs with user ratings info for items
• Create Hive tables to build structure on top of this
log data
• Generate Mahout specific csv input file
(user, item, rating)
• Run Mahout to build item recommendations for
users
o mahout recommeditembased 
--input /user/hive/warehouse/mahout_input 
--output recommendations 
-s SIMILARITY_PEARSON_CORRELATION –n 20
Recap
Why
Hadoop?
What is
Hadoop?
How to
Hadoop?
Demo
Data Growth
What is Big Data?
Hadoop usage
Components
No SQL
Cluster
Vendors
Tool Comparison
Typical Implementation
Data Analysis with Pig & Hive
Opportunities
Map Reduce deep dive
Wordcount
Search index
Recommendation Engine
Q & A
Contact Siva Pandeti:
Email: siva@pandeti.com
LinkedIn: www.linkedin.com/in/SivaPandeti
Twitter: @SivaPandeti
http://pandeti.com/blog

Contenu connexe

Tendances

Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 

Tendances (20)

Hadoop
HadoopHadoop
Hadoop
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
HDFS
HDFSHDFS
HDFS
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 

En vedette

Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
ETL tool evaluation criteria
ETL tool evaluation criteriaETL tool evaluation criteria
ETL tool evaluation criteriaAsis Mohanty
 

En vedette (8)

Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Informatica session
Informatica sessionInformatica session
Informatica session
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
ETL tool evaluation criteria
ETL tool evaluation criteriaETL tool evaluation criteria
ETL tool evaluation criteria
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similaire à Hadoop overview

Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BIPrasad Prabhu (PP)
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoopAmbuj Kumar
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data Amar kumar
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introductiondewang_mistry
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Elephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopElephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopStuart Ainsworth
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNDataWorks Summit
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Imply
 

Similaire à Hadoop overview (20)

Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Elephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopElephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to Hadoop
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
 
Uotm workshop
Uotm workshopUotm workshop
Uotm workshop
 

Dernier

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Dernier (20)

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

Hadoop overview

  • 1. Hadoop in a Nutshell Siva Pandeti Cloudera Certified Developer for Apache Hadoop (CCDH)
  • 2. Overview Why Hadoop? What is Hadoop? How to Hadoop? Examples Data Growth What is Big Data? Hadoop usage Components No SQL Cluster Vendors Tool Comparison Typical Implementation Data Analysis with Pig & Hive Opportunities Map Reduce deep dive Wordcount Search index Recommendation Engine
  • 4. Data Growth OLTP Databases for Operations Throw away historical data Relational Oracle, DB2 OLAP Data warehouses for analytics Cheaper centralized storage -> Data warehouses (ETL tools) Relational/MPP appliances < few hundred TB Big Data Data explosion (social media, etc) Petabyte scale Network speeds haven’t increased Need Data Locality Distributed processing on commodity hardware (Hadoop) Non-relational
  • 5. Big Data What is Big Data? Volume Petabyte scale Variety Structured Semi-structured Unstructured Velocity Social Sensor Throughput Veracity Unclean Imprecise Unclear
  • 6. Where is Hadoop Used? Industry Technology Use Cases Search People you may know Movie recommendations Banks Fraud Detection Regulatory Risk management Media Retail Marketing analytics Customer service Product recommendations Manufacturing Preventive maintenance
  • 8. Hadoop HDFS Distributed Storage Economical: commodity hardware Scalable: rebalances data on new nodes Fault Tolerant: detects faults & auto recovers Reliable: maintains multiple copies of data High throughput: because data is distributed Open source distributed computing framework for storage and processing What is Hadoop? MapReduce Distributed Processing Data Locality: process where the data resides Fault Tolerant: auto-recover job failures Scalable: add nodes to increase parallelism Economical: commodity hardware
  • 9. • Unlike RDBMS: o De-normalized o No secondary indexes o No transactions • Modeled after Google’s Big Table • Random real time read/write access to Big Data • Billions of rows x millions of columns • Commodity hardware • Open source, distributed, versioned, column oriented • Integrates with MapReduce; Has Java/REST APIs • Automatic sharding NoSQL DBs - HBase Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
  • 10. Master Node Slave Node Slave Node Slave Node Job Tracker Task Tracker Task Tracker Task Tracker Name Node Data Node Data Node Data Node Cluster How Does Hadoop Work?
  • 11. Vendors Apache Hadoop Cloudera HortonWorks MapR Pentaho Informatica Talend Clover EMR ETL/BI Connectors Hadoop Distributors Microstrategy Tableau SASAbInitio
  • 12. Comparison Traditional ETL/BI Expensive license Expensive hardware Hadoop Open source Cheap commodity hardware < 100 TB Central storage Petabyte scale Distributed storage CostVolume Quick response for processing small data Not as fast on large data Even smallest job takes 15 seconds Super fast on large data Speed Thousands of reads/writes per minute Millions of reads/writes per minute Thruput
  • 14. HDFS Hadoop Flume Sqoop Ingest Put/Get ETL tools RDBMS Data Feeds Files Hadoop Implementation Reports Machine Learning Output Analytics Visualization SAS R MapReduce Pig Hive Mahout Process
  • 15. Data Analysis: Pig & Hive Pig Hive Abstraction on top of MapReduce. Generates MapReduce jobs in the backend. Useful for analysts who are not programmers. Data flow language No schema Better with less structured Data SQL like language Schema, tables, joins are stored in a meta-store. Example LOAD ‘file’ USING PigStorage(‘t’) AS (id, name); FILTER FOREACH GROUP ORDER STORE Example CREATE TABLE customer (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘t’; SELECT * from customer WHERE id < 100 limit 10;
  • 18. Word count - Java • Copy input files to HDFS o hadoop fs –put file1.txt input • Create driver o Set configuration variables, mapper and reducer class names • Create mapper o Read input and emit key value pairs • Create reducer (optional) o Aggregate all values for a particular key • Execute o hadoop jar WordCount.jar WordCount input output • Analyze output o hadoop fs –cat output/* | head
  • 19. Word count - Streaming • Hadoop is written in Java. I don’t know Java. What do I do? o Hadoop Streaming (Python, Ruby, R, etc) • Copy input files to HDFS o hadoop fs –put file1.txt input • Create mapper o Read input stream (stdin) and emit (print) key value pairs • Create reducer (optional) o Aggregate all values for a particular key • Execute o hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop- stream*.jar -mapper mapper.py –file mapper.py -reducer reducer.py –file reducer.py -input input –output output • Analyze output o hdoop fs –cat output/* | head
  • 20. Hadoop for R Sys.setenv(HADOOP_HOME="/home/istvan/hadoop") Sys.setenv(HADOOP_CMD="/home/istvan/hadoop/bin/hadoop") library(rmr2) library(rhdfs) setwd("/home/istvan/rhadoop/blogs/") gdp <- read.csv("GDP_converted.csv") head(gdp) hdfs.init() gdp.values <- to.dfs(gdp) # AAPL revenue in 2012 in millions USD aaplRevenue = 156508 gdp.map.fn <- function(k,v) { key <- ifelse(v[4] < aaplRevenue, "less", "greater") keyval(key, 1) } count.reduce.fn <- function(k,v) { keyval(k, length(v)) } count <- mapreduce(input=gdp.values, map = gdp.map.fn, reduce = count.reduce.fn) from.dfs(count) • RHadoop package o rmr o rhdfs o Rhbase • Uses Hadoop Streaming • Example on the right determines how many countries have greater GDP than Apple Source: http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/
  • 21. Search index example • Crawl web o Crawl and save websites to local directory • Ingest files to HDFS • Map o Split the words & associate words with file names • Reduce o Build an index with words and files & count of occurrences • Search o Pass the word to the index to get the files it shows up in. Display the file listing in descending order of number of occurrences of the word in a file
  • 22. Recommender example • Use web server logs with user ratings info for items • Create Hive tables to build structure on top of this log data • Generate Mahout specific csv input file (user, item, rating) • Run Mahout to build item recommendations for users o mahout recommeditembased --input /user/hive/warehouse/mahout_input --output recommendations -s SIMILARITY_PEARSON_CORRELATION –n 20
  • 23. Recap Why Hadoop? What is Hadoop? How to Hadoop? Demo Data Growth What is Big Data? Hadoop usage Components No SQL Cluster Vendors Tool Comparison Typical Implementation Data Analysis with Pig & Hive Opportunities Map Reduce deep dive Wordcount Search index Recommendation Engine
  • 24. Q & A Contact Siva Pandeti: Email: siva@pandeti.com LinkedIn: www.linkedin.com/in/SivaPandeti Twitter: @SivaPandeti http://pandeti.com/blog