SlideShare une entreprise Scribd logo
1  sur  45
A new way to store and analyze data
Sandesh Deshmane
• What is Hadoop?
• Why, Where, When?
• Benefits of Hadoop
• How Hadoop Works?
• Hadoop Architecture
• HDFS
• Hadoop MapReduce
• Installation &
Execution
• Demo
Topics Covered
In pioneer days they used oxen for heavy pulling, and when
one ox couldn’t budge a log, they didn’t try to grow a
larger ox. We shouldn’t be trying for bigger computers, but
more systems of computers.
—Grace Hopper
History
• Size of digital Universe was estimated 0.18 zeta byte in 2006 and was 3
zeta byte in 2012
1 zeta byte =10^21
bytes=1k exa bytes=1 million petabyte=
1 billion terabytes
• The New York stock exchange generate 1TB data per day
• Facebook stores around 10 billion photos . Around 1 petabyte.
• The internet archive stores 1 peta byte data and its growing
( 20 TB per month).
Background
• Created by Douglas Reed Cutting,
• Open-source Apache Software Foundation.
• consists of two key services:
a. Reliable data storage using the Hadoop Distributed File System
(HDFS).
b. High-performance parallel data processing using a technique called
Map Reduce.
• Hadoop is large-scale, high-performance processing jobs — in spite
of system changes or failures.
What is Hadoop?
• Need to process 100TB datasets
• On 1 node:
– scanning @ 50MB/s = 23 days
• On 1000 node cluster:
– scanning @ 50MB/s = 33 min
• Need Efficient, Reliable and Usable framework
Hadoop, Why?
Where
• Batch data processing, not
real-time / user facing (e.g.
Document Analysis and
Indexing, Web Graphs and
Crawling)
• Highly parallel data intensive
distributed applications
• Very large production
deployments
When
• Process lots of unstructured
data
• When your processing can
easily be made parallel
• Running batch jobs is
acceptable
• When you have access to lots
of cheap hardware
Where and When Hadoop?
• Runs on cheap commodity hardware
• Automatically handles data replication and node
failure
• It does the hard work – you can focus on processing
data
• Cost Saving and efficient and reliable data
processing
Benefits of Hadoop
• Hadoop implements a computational paradigm named
Map/Reduce, where the application is divided into many small
fragments of work, each of which may be executed or re-executed
on any node in the cluster.
• In addition, it provides a distributed file system (HDFS) that
stores data on the compute nodes, providing very high aggregate
bandwidth across the cluster.
• Both Map/Reduce and the distributed file system are designed so
that node failures are automatically handled by the framework.
How Hadoop Works?
The Apache Hadoop project develops open-source software for reliable,
scalable, distributed computing
Hadoop Consists of:
• Hadoop Common: The common utilities that support the other
Hadoop subprojects.
• HDFS: A distributed file system that provides high throughput
access to application data.
• MapReduce: A software framework for distributed processing of
large data sets on compute clusters.
Hadoop Architecture
Web
Servers
Scribe
Servers
Network
Storage
Hadoop ClusterOracle DB MySQL
Hadoop Architecture
• Java
• Python
• Ruby
• C++ (Hadoop Pipes)
Supported Languages
• Know as Hadoop Distribute File System
• Primary storage system for Hadoop Apps
• Multiple replicas of data blocks distributed on compute nodes for
reliability
• Files are stored on multiple boxes for durability and high availability
HDFS
• Distributed File System = holds large amount of data and
provides access to this data to many clients distributed
across a network . e.g NFS
• HDFS stores large amount of Information than DFS
• HDFS stores data reliably.
• HDFS provides fast, scalable access to this information to
large number of clients in Cluster
DFS vs. HDFS
• Optimized for long sequential reads
• Data written once , read multiple times, no append possible
• Large file, sequential reads so no local caching of data.
• Data replication
HDFS
HDFS Architecture
• Block Structure files system
• File is divided to bocks and stored
• Each individual machine in cluster is Data Node
• Default block size is 64 MB
• Information of blocks is stored in metadata
• All this meta data is stored on machine which is Name Node
HDFS Architecture
Data Node and Data Name
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://your.server.name.com:9000</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/username/hdfs/data</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/username/hdfs/name</value>
</property>
</configuration>
HDFS Config File
public class HDFSHelloWorld {
public static final String theFilename = "hello.txt";
public static final String message = "Hello, world!n";
public static void main (String [] args) throws IOException {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path filenamePath = new Path(theFilename);
try {
if (fs.exists(filenamePath)) {
// remove the file first
fs.delete(filenamePath);
}
FSDataOutputStream out = fs.create(filenamePath);
out.writeUTF(message);
out.close();
FSDataInputStream in = fs.open(filenamePath);
String messageIn = in.readUTF();
System.out.print(messageIn);
in.close();
} catch (IOException ioe) {
System.err.println("IOException during operation: " + ioe.toString());
System.exit(1);
}
}
Sample Java Code to Read/Write from HDFS
Map Reduce
Cluster Look
Map
Reduce
• HDFS handles the Distributed File System layer
• MapReduce is how we process the data
• MapReduce Daemons
JobTracker
TaskTracker
• Goals
Distribute the reading and processing of data
Localize the processing when possible
Share as little data as possible while processing
MapReduce
MapReduce
• One per cluster “master node”
• Takes jobs from clients
• Splits work into “tasks”
• Distributes “tasks” to TaskTrackers
• Monitors progress, deals with failures
Job Tracker
• Many per cluster “slave nodes”
• Does the actual work, executes the code for the job
• Talks regularly with JobTracker
• Launches child process when given a task
• Reports progress of running “task” back to JobTracker
Task Tracker
• Client Submits job:
I want to count the count of each word
We will assume that the data to process is already there in HDFS
• Job Tracker receives job
• Queries the NamNode for number of blocks in File
• The job is split into Tasks
• One map task per each block
• As many reduce tasks as specified in the Job
• TaskTracker checks in Regularly with JobTracker
Is there any work for me ?
• If the JobTracker has a MapTask that the TaskTracker has a local block for
the file being processed then the TaskTracker will be given the “task”
Anatomy of Map Reduce Job
Map Reduce Job – Big Picture
Client Submits to JobTracker
JobTracker Queries Name Node for Block Info
Job tracker Defines Job as Collection of Tasks
Task Trackers Checking in are Assigned tasks
Task Trackers Checking in are Assigned tasks
• Read text files and count how often words occur.
o The input is text files
o The output is a text file
 each line: word, tab, count
• Map: Produce pairs of (word, count)
• Reduce: For each word, sum up the counts.
Example of MapReduce - Word Count
public static class MapClass extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map (LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens())
{
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
Map Class
public static class ReduceClass extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws
IOException
{
int sum = 0;
while (values.hasNext())
{
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Reduce Class
public void run(String inputPath, String outputPath) throws Exception
{
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount"); // the keys are words (strings)
conf.setOutputKeyClass(Text.class); // the values are counts (ints)
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);
FileInputFormat.addInputPath(conf, new Path(inputPath));
FileOutputFormat.setOutputPath(conf, new Path(outputPath));
JobClient.runJob(conf);
}
Driver Class
import static org.mockito.Matchers.anyObject;
import static org.mockito.Mockito.*;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.OutputCollector;
import org.junit.*;
public class WordCountMapperTest {
@Test
public void processesValidRecord() throws IOException {
MapClass mapper = new MapClass ();
Text value = new Text(“test test”)
OutputCollector<Text, IntWritable> output = mock(OutputCollector.class);
mapper.map(null, value, output, null);
verify(output).collect(new Text(“test"), new IntWritable(2));
}
}
Junit For Mapper
Junit for Reducer
import static org.mockito.Matchers.anyObject;
import static org.mockito.Mockito.*;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.OutputCollector;
import org.junit.*;
@Test
public void returnsMaximumIntegerInValues() throws IOException {
ReduceClass reducer = new ReduceClass ();
Text key = new Text(“test");
Iterator<IntWritable> values = Arrays.asList(
new IntWritable(1), new IntWritable(1)).iterator();
OutputCollector<Text, IntWritable> output = mock(OutputCollector.class);
reducer.reduce(key, values, output, null);
verify(output).collect(key, new IntWritable(2));
}
Installation :
• Requirements: Linux, Java 1.6,
sshd,
• Configure SSH for password-free
authentication
• Unpack Hadoop distribution
• Edit a few configuration files
• Format the DFS on the name
node
• Start all the daemon processes
Execution:
• Compile your job into a JAR file
• Copy input data into HDFS
• Execute bin/hadoop jar with
relevant args
• Monitor tasks via Web interface
(optional)
• Examine output when job is
complete
Let’s Go…
Demo
Hadoop Users
• Adobe
• Alibaba
• Amazon
• AOL
• Facebook
• Google
• IBM
Major Contributor
• Apache
• Cloudera
• Yahoo
Hadoop Community
• Apache Hadoop! (http://hadoop.apache.org )
• Hadoop on Wikipedia
(http://en.wikipedia.org/wiki/Hadoop)
• Free Search by Doug Cutting
(http://cutting.wordpress.com )
• Hadoop and Distributed Computing at Yahoo!
(http://developer.yahoo.com/hadoop )
• Cloudera - Apache Hadoop for the Enterprise
(http://www.cloudera.com )
References

Contenu connexe

Tendances

Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceNeev Technologies
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop WorldCloudera, Inc.
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real WorldMark Kromer
 
Introduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleIntroduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleSpringPeople
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInDataWorks Summit
 

Tendances (20)

Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
Big data and tools
Big data and tools Big data and tools
Big data and tools
 
Introducing Big Data
Introducing Big DataIntroducing Big Data
Introducing Big Data
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop World
 
Introducing Data Lakes
Introducing Data LakesIntroducing Data Lakes
Introducing Data Lakes
 
Atul Mithe
Atul MitheAtul Mithe
Atul Mithe
 
Case study on big data
Case study on big dataCase study on big data
Case study on big data
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Deep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profitDeep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profit
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
 
Introduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleIntroduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeople
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
 
Concepts on Hadoop
Concepts on HadoopConcepts on Hadoop
Concepts on Hadoop
 

Similaire à Big Data Technologies - Hadoop

Similaire à Big Data Technologies - Hadoop (20)

Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
HADOOP.pptx
HADOOP.pptxHADOOP.pptx
HADOOP.pptx
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 

Plus de Talentica Software

Typescript: Beginner to Advanced
Typescript: Beginner to AdvancedTypescript: Beginner to Advanced
Typescript: Beginner to AdvancedTalentica Software
 
Web Performance & Latest in React
Web Performance & Latest in ReactWeb Performance & Latest in React
Web Performance & Latest in ReactTalentica Software
 
Nodejs Chapter 3 - Design Pattern
Nodejs Chapter 3 - Design PatternNodejs Chapter 3 - Design Pattern
Nodejs Chapter 3 - Design PatternTalentica Software
 
Setting Up Development Environment For Google App Engine & Python | Talentica
Setting Up Development Environment For Google App Engine & Python | TalenticaSetting Up Development Environment For Google App Engine & Python | Talentica
Setting Up Development Environment For Google App Engine & Python | TalenticaTalentica Software
 
Connected World in android - Local data sharing and service discovery
Connected World in android - Local data sharing and service discoveryConnected World in android - Local data sharing and service discovery
Connected World in android - Local data sharing and service discoveryTalentica Software
 
Mobile App Monetization - Ecosystem & Emerging Trends
Mobile App Monetization - Ecosystem & Emerging TrendsMobile App Monetization - Ecosystem & Emerging Trends
Mobile App Monetization - Ecosystem & Emerging TrendsTalentica Software
 
Android Media Player Development
Android Media Player DevelopmentAndroid Media Player Development
Android Media Player DevelopmentTalentica Software
 
Cross Platform Mobile Technologies
Cross Platform Mobile TechnologiesCross Platform Mobile Technologies
Cross Platform Mobile TechnologiesTalentica Software
 
Continous Integration: A Case Study
Continous Integration: A Case StudyContinous Integration: A Case Study
Continous Integration: A Case StudyTalentica Software
 
Technology Challenges in Building New Media Applications
Technology Challenges in Building New Media ApplicationsTechnology Challenges in Building New Media Applications
Technology Challenges in Building New Media ApplicationsTalentica Software
 
Flex on Grails - Rich Internet Applications With Rapid Application Development
Flex on Grails - Rich Internet Applications With Rapid Application DevelopmentFlex on Grails - Rich Internet Applications With Rapid Application Development
Flex on Grails - Rich Internet Applications With Rapid Application DevelopmentTalentica Software
 

Plus de Talentica Software (20)

Typescript: Beginner to Advanced
Typescript: Beginner to AdvancedTypescript: Beginner to Advanced
Typescript: Beginner to Advanced
 
Web 3.0
Web 3.0Web 3.0
Web 3.0
 
Remix
RemixRemix
Remix
 
Web Performance & Latest in React
Web Performance & Latest in ReactWeb Performance & Latest in React
Web Performance & Latest in React
 
Nodejs Chapter 3 - Design Pattern
Nodejs Chapter 3 - Design PatternNodejs Chapter 3 - Design Pattern
Nodejs Chapter 3 - Design Pattern
 
Node js Chapter-2
Node js Chapter-2Node js Chapter-2
Node js Chapter-2
 
Node.js Chapter1
Node.js Chapter1Node.js Chapter1
Node.js Chapter1
 
Micro Frontends
Micro FrontendsMicro Frontends
Micro Frontends
 
Test Policy and Practices
Test Policy and PracticesTest Policy and Practices
Test Policy and Practices
 
Advanced JavaScript
Advanced JavaScriptAdvanced JavaScript
Advanced JavaScript
 
Setting Up Development Environment For Google App Engine & Python | Talentica
Setting Up Development Environment For Google App Engine & Python | TalenticaSetting Up Development Environment For Google App Engine & Python | Talentica
Setting Up Development Environment For Google App Engine & Python | Talentica
 
Connected World in android - Local data sharing and service discovery
Connected World in android - Local data sharing and service discoveryConnected World in android - Local data sharing and service discovery
Connected World in android - Local data sharing and service discovery
 
Mobile App Monetization - Ecosystem & Emerging Trends
Mobile App Monetization - Ecosystem & Emerging TrendsMobile App Monetization - Ecosystem & Emerging Trends
Mobile App Monetization - Ecosystem & Emerging Trends
 
Android Media Player Development
Android Media Player DevelopmentAndroid Media Player Development
Android Media Player Development
 
Cross Platform Mobile Technologies
Cross Platform Mobile TechnologiesCross Platform Mobile Technologies
Cross Platform Mobile Technologies
 
Big Data – Are You Ready?
Big Data – Are You Ready?Big Data – Are You Ready?
Big Data – Are You Ready?
 
Legacy modernization
Legacy modernizationLegacy modernization
Legacy modernization
 
Continous Integration: A Case Study
Continous Integration: A Case StudyContinous Integration: A Case Study
Continous Integration: A Case Study
 
Technology Challenges in Building New Media Applications
Technology Challenges in Building New Media ApplicationsTechnology Challenges in Building New Media Applications
Technology Challenges in Building New Media Applications
 
Flex on Grails - Rich Internet Applications With Rapid Application Development
Flex on Grails - Rich Internet Applications With Rapid Application DevelopmentFlex on Grails - Rich Internet Applications With Rapid Application Development
Flex on Grails - Rich Internet Applications With Rapid Application Development
 

Dernier

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Dernier (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Big Data Technologies - Hadoop

  • 1. A new way to store and analyze data Sandesh Deshmane
  • 2. • What is Hadoop? • Why, Where, When? • Benefits of Hadoop • How Hadoop Works? • Hadoop Architecture • HDFS • Hadoop MapReduce • Installation & Execution • Demo Topics Covered
  • 3. In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but more systems of computers. —Grace Hopper History
  • 4. • Size of digital Universe was estimated 0.18 zeta byte in 2006 and was 3 zeta byte in 2012 1 zeta byte =10^21 bytes=1k exa bytes=1 million petabyte= 1 billion terabytes • The New York stock exchange generate 1TB data per day • Facebook stores around 10 billion photos . Around 1 petabyte. • The internet archive stores 1 peta byte data and its growing ( 20 TB per month). Background
  • 5. • Created by Douglas Reed Cutting, • Open-source Apache Software Foundation. • consists of two key services: a. Reliable data storage using the Hadoop Distributed File System (HDFS). b. High-performance parallel data processing using a technique called Map Reduce. • Hadoop is large-scale, high-performance processing jobs — in spite of system changes or failures. What is Hadoop?
  • 6. • Need to process 100TB datasets • On 1 node: – scanning @ 50MB/s = 23 days • On 1000 node cluster: – scanning @ 50MB/s = 33 min • Need Efficient, Reliable and Usable framework Hadoop, Why?
  • 7. Where • Batch data processing, not real-time / user facing (e.g. Document Analysis and Indexing, Web Graphs and Crawling) • Highly parallel data intensive distributed applications • Very large production deployments When • Process lots of unstructured data • When your processing can easily be made parallel • Running batch jobs is acceptable • When you have access to lots of cheap hardware Where and When Hadoop?
  • 8. • Runs on cheap commodity hardware • Automatically handles data replication and node failure • It does the hard work – you can focus on processing data • Cost Saving and efficient and reliable data processing Benefits of Hadoop
  • 9. • Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. • In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. • Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework. How Hadoop Works?
  • 10. The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing Hadoop Consists of: • Hadoop Common: The common utilities that support the other Hadoop subprojects. • HDFS: A distributed file system that provides high throughput access to application data. • MapReduce: A software framework for distributed processing of large data sets on compute clusters. Hadoop Architecture
  • 12. • Java • Python • Ruby • C++ (Hadoop Pipes) Supported Languages
  • 13. • Know as Hadoop Distribute File System • Primary storage system for Hadoop Apps • Multiple replicas of data blocks distributed on compute nodes for reliability • Files are stored on multiple boxes for durability and high availability HDFS
  • 14. • Distributed File System = holds large amount of data and provides access to this data to many clients distributed across a network . e.g NFS • HDFS stores large amount of Information than DFS • HDFS stores data reliably. • HDFS provides fast, scalable access to this information to large number of clients in Cluster DFS vs. HDFS
  • 15. • Optimized for long sequential reads • Data written once , read multiple times, no append possible • Large file, sequential reads so no local caching of data. • Data replication HDFS
  • 17. • Block Structure files system • File is divided to bocks and stored • Each individual machine in cluster is Data Node • Default block size is 64 MB • Information of blocks is stored in metadata • All this meta data is stored on machine which is Name Node HDFS Architecture
  • 18. Data Node and Data Name
  • 20. public class HDFSHelloWorld { public static final String theFilename = "hello.txt"; public static final String message = "Hello, world!n"; public static void main (String [] args) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path filenamePath = new Path(theFilename); try { if (fs.exists(filenamePath)) { // remove the file first fs.delete(filenamePath); } FSDataOutputStream out = fs.create(filenamePath); out.writeUTF(message); out.close(); FSDataInputStream in = fs.open(filenamePath); String messageIn = in.readUTF(); System.out.print(messageIn); in.close(); } catch (IOException ioe) { System.err.println("IOException during operation: " + ioe.toString()); System.exit(1); } } Sample Java Code to Read/Write from HDFS
  • 23. Map
  • 25. • HDFS handles the Distributed File System layer • MapReduce is how we process the data • MapReduce Daemons JobTracker TaskTracker • Goals Distribute the reading and processing of data Localize the processing when possible Share as little data as possible while processing MapReduce
  • 27. • One per cluster “master node” • Takes jobs from clients • Splits work into “tasks” • Distributes “tasks” to TaskTrackers • Monitors progress, deals with failures Job Tracker
  • 28. • Many per cluster “slave nodes” • Does the actual work, executes the code for the job • Talks regularly with JobTracker • Launches child process when given a task • Reports progress of running “task” back to JobTracker Task Tracker
  • 29. • Client Submits job: I want to count the count of each word We will assume that the data to process is already there in HDFS • Job Tracker receives job • Queries the NamNode for number of blocks in File • The job is split into Tasks • One map task per each block • As many reduce tasks as specified in the Job • TaskTracker checks in Regularly with JobTracker Is there any work for me ? • If the JobTracker has a MapTask that the TaskTracker has a local block for the file being processed then the TaskTracker will be given the “task” Anatomy of Map Reduce Job
  • 30. Map Reduce Job – Big Picture
  • 31. Client Submits to JobTracker
  • 32. JobTracker Queries Name Node for Block Info
  • 33. Job tracker Defines Job as Collection of Tasks
  • 34. Task Trackers Checking in are Assigned tasks
  • 35. Task Trackers Checking in are Assigned tasks
  • 36. • Read text files and count how often words occur. o The input is text files o The output is a text file  each line: word, tab, count • Map: Produce pairs of (word, count) • Reduce: For each word, sum up the counts. Example of MapReduce - Word Count
  • 37. public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map (LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } Map Class
  • 38. public static class ReduceClass extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Reduce Class
  • 39. public void run(String inputPath, String outputPath) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); // the keys are words (strings) conf.setOutputKeyClass(Text.class); // the values are counts (ints) conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); JobClient.runJob(conf); } Driver Class
  • 40. import static org.mockito.Matchers.anyObject; import static org.mockito.Mockito.*; import java.io.IOException; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.OutputCollector; import org.junit.*; public class WordCountMapperTest { @Test public void processesValidRecord() throws IOException { MapClass mapper = new MapClass (); Text value = new Text(“test test”) OutputCollector<Text, IntWritable> output = mock(OutputCollector.class); mapper.map(null, value, output, null); verify(output).collect(new Text(“test"), new IntWritable(2)); } } Junit For Mapper
  • 41. Junit for Reducer import static org.mockito.Matchers.anyObject; import static org.mockito.Mockito.*; import java.io.IOException; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.OutputCollector; import org.junit.*; @Test public void returnsMaximumIntegerInValues() throws IOException { ReduceClass reducer = new ReduceClass (); Text key = new Text(“test"); Iterator<IntWritable> values = Arrays.asList( new IntWritable(1), new IntWritable(1)).iterator(); OutputCollector<Text, IntWritable> output = mock(OutputCollector.class); reducer.reduce(key, values, output, null); verify(output).collect(key, new IntWritable(2)); }
  • 42. Installation : • Requirements: Linux, Java 1.6, sshd, • Configure SSH for password-free authentication • Unpack Hadoop distribution • Edit a few configuration files • Format the DFS on the name node • Start all the daemon processes Execution: • Compile your job into a JAR file • Copy input data into HDFS • Execute bin/hadoop jar with relevant args • Monitor tasks via Web interface (optional) • Examine output when job is complete Let’s Go…
  • 43. Demo
  • 44. Hadoop Users • Adobe • Alibaba • Amazon • AOL • Facebook • Google • IBM Major Contributor • Apache • Cloudera • Yahoo Hadoop Community
  • 45. • Apache Hadoop! (http://hadoop.apache.org ) • Hadoop on Wikipedia (http://en.wikipedia.org/wiki/Hadoop) • Free Search by Doug Cutting (http://cutting.wordpress.com ) • Hadoop and Distributed Computing at Yahoo! (http://developer.yahoo.com/hadoop ) • Cloudera - Apache Hadoop for the Enterprise (http://www.cloudera.com ) References