SlideShare une entreprise Scribd logo
1  sur  9
Télécharger pour lire hors ligne
1www.aditi.com
Introducing
Hadoop
on Azure
M Sheik Uduman Ali
Technical Architect, Aditi Technologies
Instead of reinventing the wheel, Microsoft takes a strong and
brilliant move to integrate Hadoop on its blockbuster cloud
computing PaaS stack. Isn't it? Of course, LINQ2HPC was
embraced many .NET developers, however, Hadoop distribu-
tion for Windows is also the safest move. This paper evaluates
the early preview of Hadoop on Azure. It cover the basics of
using Hadoop on Azure. It would be helpful to read
about MapReduce and Hadoop Topology before learning
about Hadoop on Azure.
For comments or questions regarding the content of this pa-
per, please contact
Sunny Neogi (sunnyn@aditi.com) or
Arun Kumar (arung@aditi.com)
www.aditi.com
2www.aditi.com
Why do we need Hadoop?
The simple answer to this question is "Big data analysis". Some examples of
big data analysis are:
 Calculating consumers purchasing trend on particular product categories
based on the growing big data with the rate of 1 million transactions per
hour
 Web application log analysis
 Internet search indexing
 Social network data
Since relational databases and its ecosystem were designed on "scale-up"
strategy with centralized data processing, they are not much suitable for data
warehousing space. And the data persistence of modern applications is mix
and match of relational, structured and non-structured. Hence, we need a
much more powerful system. Hadoop is one of the successful open source
platform based on MapReduce principle, which in turn follows the "Making
big by small" philosophy.
The big data processing is called as "Job" since it would be done very fre-
quently, periodically, some in a while or only once. It is not to be part of day
to day business.
3www.aditi.com
ABOUT ADITI
Basically, the input data is processed on "n" number of small physical nodes in
a clustered environment in two different phases:
 Map: The input data needs to be grouped as <k1, v1> key-value pair. For
example, if the input data reside in one or more files, then k1 would be the
file name and v1 be the file content. Hence, the map phase receives list of
<k1, v1>. It splits each k1 into available map nodes in the cluster. On every
node, the mapping function mostly performs "filtering and transfor-
mation“ and produces <k2, v2>. For example, if you want to count the
number of occurrences of words in the given set of documents, <filename,
content> as <k1, v1> and the nodes in the mapping phase does counting
the words in the given v1. This will generate output like <"aditi", 1> as <k2,
v2> for every occurrence of the word "Aditi" in a document. Here, "aditi" is
one of the words in the document. Hence, the output of mapping phase is
list of <k2, v2>. For example, there are many <"aditi", 1> in the <k2, v2>.
 Reduce: All <k2, v2> are aggregated and created <k2, list(v2)>. In the
word count example, a node in the Hadoop cluster may produce may
<"aditi", List(1, 1, 1, 1)> from all the documents from different nodes. Eve-
ry list(v2) for k2 passed to a node for reducing. The output will be list of
<k3, v3>. For example, if a node receives "aditi" as k2, it just accumulates
all List(1+1+1+1) as v2 and produces 4 as v3. Here, k3 is again
"aditi". Each reducer node does the same for different words.
The <k2, v2> aggregation is actually performed by a component called
"combiner". As of now, let us keep focus on the mapper and reducer.
See the below figure (figure 1):
What are the layers of
Architecture?
What is MapReduce?
4www.aditi.com
ABOUT ADITI
Hadoop cluster is an infrastructure with many physical nodes, where some are
configured for "mapping" and some are for "reducing" along with administra-
tive, tracking and data persistence nodes called as "Name Node", "Job Track-
er", "Task Tracker" and "Data Node" respectively. This is a master/slave archi-
tecture "Name Node" and "Job Tracker" are masters and remaining are
slaves. This is shown in figure 2.
In order handle big data storage and processing, Hadoop uses HDFS as a file
system which even handle 100 TB content as a single file.
What are the layers of
Architecture?
Hadoop Cluster
5www.aditi.com
ABOUT ADITI
Since every task is called as "Job", you can rent required nodes for your job,
use and release. Hence, the elastic computing and data storage (blob and ta-
ble storage) in Azure is definitely the good choice for running your Hadoop
job. The home land for Hadoop is Java, at this early stage on Azure, Hadoop
Java SDK is one of the good options for your job. In addition to this, the
"Hadoop on Azure" leverages the elasticity of Azure storage with Hadoop
streaming, by which you can write your job on C# or F# and use Azure blob
for data persistence (the scheme is called as ASV). The figure below shows the
Hadoop ecosystem on Azure (figure 3).
To create directories, get and put files, and issue some data processing com-
mands on HDFS/ASV, Azure provides interactive JavaScript console. (In the ac-
tual Hadoop distribution, Java is the main interface for this). In addition to
this, Azure supports Hive (SQL like language in Hadoop) and Pig Latin (high
level data processing language).
What are the layers of
Architecture?
Hadoop Ecosystem on Azure
6www.aditi.com
ABOUT ADITI
The www.hadooponazure.com is the management portal to create, release
and renew clusters for your job. The following are the steps you need to per-
form to run job:
1. Develop the mapping and reducing functions either in Java or your pre-
ferred platform. For non-Windows, it could be shell scripts, ruby, php, Py-
thon, etc. In Azure, you can write the code in .NET.
2. Decide from where the input data and output result of the job need to be
managed. Either in HDFS or Azure Blob.
3. Request a cluster for the job in the portal
4. Specify all the parameters for the job which includes the executable for the
job, input and output details
5. Run the job and get the output
6. Release the cluster
In this post, let us see the step 3, how we can create a cluster for a job.
Requesting a new Cluster
After you entered into the portal, you need to enter the following details for
the new cluster environment as shown in the below figure (figure 4):
 DNS name (<dnsname>.cloudapp.net)
 Cluster size - like Azure role size. 4 nodes + 2 TB disk space = small, 32
nodes + 16 TB = extra large
 Cluster login information
What are the layers of
Architecture?
TheWeb Portal
for Hadoop on Azure
7www.aditi.com
ABOUT ADITI
After entering these details, press Request Cluster button. This will create the
cluster environment for your job. The screen shows the progress of creating
new nodes for the cluster as shown in the below figure (figure 5):
8www.aditi.com
ABOUT ADITI
After the provisioning, you will see a screen as shown below (figure 6):
You can start create a new job and if you want to access the environment you
can use either "Interactive Console" or “Remote Desktop".
9www.aditi.com
ABOUT ADITI
The above figure is a Hadoop Streaming based job.
——————————————————————————————————
About the Author:
M Sheik Uduman Ali is a cloud architect at Aditi who has involved in cloud practic-
es. He is a blogger and published an online book about "Domain Specific Languages
in .NET".
Aditi helps product companies, web businesses and enterprises leverage the power of cloud, e-social and mobile, to
drive competitive advantage. We are Microsoft cloud partner of the year; one of the top 3 Platform-as-a-Service so-
lution providers globally and one of the top 5 Microsoft technology partners in US. We are passionate about emerg-
ing technologies and are focused on custom development.
ABOUT ADITI
When you click on new job, you will see the below screen (figure 7):

Contenu connexe

Plus de HARMAN Services

Testing Strategies to Deliver Consistent App Performance
Testing Strategies to Deliver Consistent App Performance Testing Strategies to Deliver Consistent App Performance
Testing Strategies to Deliver Consistent App Performance HARMAN Services
 
How to Manage APIs in your Enterprise for Maximum Reusability and Governance
How to Manage APIs in your Enterprise for Maximum Reusability and GovernanceHow to Manage APIs in your Enterprise for Maximum Reusability and Governance
How to Manage APIs in your Enterprise for Maximum Reusability and GovernanceHARMAN Services
 
Digital Transformation: Connected API Ecosystems
Digital Transformation: Connected API EcosystemsDigital Transformation: Connected API Ecosystems
Digital Transformation: Connected API EcosystemsHARMAN Services
 
Webinar - Transforming Manufacturing with IoT
Webinar - Transforming Manufacturing with IoTWebinar - Transforming Manufacturing with IoT
Webinar - Transforming Manufacturing with IoTHARMAN Services
 
Microsoft Azure Explained - Hitesh D Kesharia
Microsoft Azure Explained - Hitesh D KeshariaMicrosoft Azure Explained - Hitesh D Kesharia
Microsoft Azure Explained - Hitesh D KeshariaHARMAN Services
 
15 Big Data Billionaires
15 Big Data Billionaires15 Big Data Billionaires
15 Big Data BillionairesHARMAN Services
 
Digital Transformation in Travel
Digital Transformation in TravelDigital Transformation in Travel
Digital Transformation in TravelHARMAN Services
 
Digital Transformation in Retail
Digital Transformation in RetailDigital Transformation in Retail
Digital Transformation in RetailHARMAN Services
 
Digital Transformation in Media
Digital Transformation in MediaDigital Transformation in Media
Digital Transformation in MediaHARMAN Services
 
Digital Transformation in Hospitality
Digital Transformation in HospitalityDigital Transformation in Hospitality
Digital Transformation in HospitalityHARMAN Services
 
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol HARMAN Services
 
Top LinkedIn Influencers Every CIO Must Follow
Top LinkedIn Influencers Every CIO Must Follow Top LinkedIn Influencers Every CIO Must Follow
Top LinkedIn Influencers Every CIO Must Follow HARMAN Services
 
Ladbrokes and Aditi - Digital Transformation Case study
Ladbrokes and Aditi - Digital Transformation Case study Ladbrokes and Aditi - Digital Transformation Case study
Ladbrokes and Aditi - Digital Transformation Case study HARMAN Services
 
How Internet of Things (IoT) is Reshaping the Automotive Sector - Infographic
How Internet of Things (IoT) is Reshaping the Automotive Sector - InfographicHow Internet of Things (IoT) is Reshaping the Automotive Sector - Infographic
How Internet of Things (IoT) is Reshaping the Automotive Sector - InfographicHARMAN Services
 
Finding the important bugs- A talk by John Scarborough, Director of Testing, ...
Finding the important bugs- A talk by John Scarborough, Director of Testing, ...Finding the important bugs- A talk by John Scarborough, Director of Testing, ...
Finding the important bugs- A talk by John Scarborough, Director of Testing, ...HARMAN Services
 
Analyzing Gartner's CIO Study: Fliping to Digital Leadership
Analyzing Gartner's CIO Study: Fliping to Digital Leadership Analyzing Gartner's CIO Study: Fliping to Digital Leadership
Analyzing Gartner's CIO Study: Fliping to Digital Leadership HARMAN Services
 
24 Connected Car features to look out for before the release of Bond 24
24 Connected Car features to look out for before the release of Bond 2424 Connected Car features to look out for before the release of Bond 24
24 Connected Car features to look out for before the release of Bond 24HARMAN Services
 
Webinar: How I Met Your Connected Customer
Webinar: How I Met Your Connected CustomerWebinar: How I Met Your Connected Customer
Webinar: How I Met Your Connected CustomerHARMAN Services
 
5 Takeaways From The UX India Conference
5 Takeaways From The UX India Conference5 Takeaways From The UX India Conference
5 Takeaways From The UX India ConferenceHARMAN Services
 
Cross-channel customer engagement: What 150 C-Level executives think about it!
Cross-channel customer engagement: What 150 C-Level executives think about it!Cross-channel customer engagement: What 150 C-Level executives think about it!
Cross-channel customer engagement: What 150 C-Level executives think about it!HARMAN Services
 

Plus de HARMAN Services (20)

Testing Strategies to Deliver Consistent App Performance
Testing Strategies to Deliver Consistent App Performance Testing Strategies to Deliver Consistent App Performance
Testing Strategies to Deliver Consistent App Performance
 
How to Manage APIs in your Enterprise for Maximum Reusability and Governance
How to Manage APIs in your Enterprise for Maximum Reusability and GovernanceHow to Manage APIs in your Enterprise for Maximum Reusability and Governance
How to Manage APIs in your Enterprise for Maximum Reusability and Governance
 
Digital Transformation: Connected API Ecosystems
Digital Transformation: Connected API EcosystemsDigital Transformation: Connected API Ecosystems
Digital Transformation: Connected API Ecosystems
 
Webinar - Transforming Manufacturing with IoT
Webinar - Transforming Manufacturing with IoTWebinar - Transforming Manufacturing with IoT
Webinar - Transforming Manufacturing with IoT
 
Microsoft Azure Explained - Hitesh D Kesharia
Microsoft Azure Explained - Hitesh D KeshariaMicrosoft Azure Explained - Hitesh D Kesharia
Microsoft Azure Explained - Hitesh D Kesharia
 
15 Big Data Billionaires
15 Big Data Billionaires15 Big Data Billionaires
15 Big Data Billionaires
 
Digital Transformation in Travel
Digital Transformation in TravelDigital Transformation in Travel
Digital Transformation in Travel
 
Digital Transformation in Retail
Digital Transformation in RetailDigital Transformation in Retail
Digital Transformation in Retail
 
Digital Transformation in Media
Digital Transformation in MediaDigital Transformation in Media
Digital Transformation in Media
 
Digital Transformation in Hospitality
Digital Transformation in HospitalityDigital Transformation in Hospitality
Digital Transformation in Hospitality
 
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
 
Top LinkedIn Influencers Every CIO Must Follow
Top LinkedIn Influencers Every CIO Must Follow Top LinkedIn Influencers Every CIO Must Follow
Top LinkedIn Influencers Every CIO Must Follow
 
Ladbrokes and Aditi - Digital Transformation Case study
Ladbrokes and Aditi - Digital Transformation Case study Ladbrokes and Aditi - Digital Transformation Case study
Ladbrokes and Aditi - Digital Transformation Case study
 
How Internet of Things (IoT) is Reshaping the Automotive Sector - Infographic
How Internet of Things (IoT) is Reshaping the Automotive Sector - InfographicHow Internet of Things (IoT) is Reshaping the Automotive Sector - Infographic
How Internet of Things (IoT) is Reshaping the Automotive Sector - Infographic
 
Finding the important bugs- A talk by John Scarborough, Director of Testing, ...
Finding the important bugs- A talk by John Scarborough, Director of Testing, ...Finding the important bugs- A talk by John Scarborough, Director of Testing, ...
Finding the important bugs- A talk by John Scarborough, Director of Testing, ...
 
Analyzing Gartner's CIO Study: Fliping to Digital Leadership
Analyzing Gartner's CIO Study: Fliping to Digital Leadership Analyzing Gartner's CIO Study: Fliping to Digital Leadership
Analyzing Gartner's CIO Study: Fliping to Digital Leadership
 
24 Connected Car features to look out for before the release of Bond 24
24 Connected Car features to look out for before the release of Bond 2424 Connected Car features to look out for before the release of Bond 24
24 Connected Car features to look out for before the release of Bond 24
 
Webinar: How I Met Your Connected Customer
Webinar: How I Met Your Connected CustomerWebinar: How I Met Your Connected Customer
Webinar: How I Met Your Connected Customer
 
5 Takeaways From The UX India Conference
5 Takeaways From The UX India Conference5 Takeaways From The UX India Conference
5 Takeaways From The UX India Conference
 
Cross-channel customer engagement: What 150 C-Level executives think about it!
Cross-channel customer engagement: What 150 C-Level executives think about it!Cross-channel customer engagement: What 150 C-Level executives think about it!
Cross-channel customer engagement: What 150 C-Level executives think about it!
 

Dernier

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 

Dernier (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 

Hadoop on Windows Azure - an Introduction

  • 1. 1www.aditi.com Introducing Hadoop on Azure M Sheik Uduman Ali Technical Architect, Aditi Technologies Instead of reinventing the wheel, Microsoft takes a strong and brilliant move to integrate Hadoop on its blockbuster cloud computing PaaS stack. Isn't it? Of course, LINQ2HPC was embraced many .NET developers, however, Hadoop distribu- tion for Windows is also the safest move. This paper evaluates the early preview of Hadoop on Azure. It cover the basics of using Hadoop on Azure. It would be helpful to read about MapReduce and Hadoop Topology before learning about Hadoop on Azure. For comments or questions regarding the content of this pa- per, please contact Sunny Neogi (sunnyn@aditi.com) or Arun Kumar (arung@aditi.com) www.aditi.com
  • 2. 2www.aditi.com Why do we need Hadoop? The simple answer to this question is "Big data analysis". Some examples of big data analysis are:  Calculating consumers purchasing trend on particular product categories based on the growing big data with the rate of 1 million transactions per hour  Web application log analysis  Internet search indexing  Social network data Since relational databases and its ecosystem were designed on "scale-up" strategy with centralized data processing, they are not much suitable for data warehousing space. And the data persistence of modern applications is mix and match of relational, structured and non-structured. Hence, we need a much more powerful system. Hadoop is one of the successful open source platform based on MapReduce principle, which in turn follows the "Making big by small" philosophy. The big data processing is called as "Job" since it would be done very fre- quently, periodically, some in a while or only once. It is not to be part of day to day business.
  • 3. 3www.aditi.com ABOUT ADITI Basically, the input data is processed on "n" number of small physical nodes in a clustered environment in two different phases:  Map: The input data needs to be grouped as <k1, v1> key-value pair. For example, if the input data reside in one or more files, then k1 would be the file name and v1 be the file content. Hence, the map phase receives list of <k1, v1>. It splits each k1 into available map nodes in the cluster. On every node, the mapping function mostly performs "filtering and transfor- mation“ and produces <k2, v2>. For example, if you want to count the number of occurrences of words in the given set of documents, <filename, content> as <k1, v1> and the nodes in the mapping phase does counting the words in the given v1. This will generate output like <"aditi", 1> as <k2, v2> for every occurrence of the word "Aditi" in a document. Here, "aditi" is one of the words in the document. Hence, the output of mapping phase is list of <k2, v2>. For example, there are many <"aditi", 1> in the <k2, v2>.  Reduce: All <k2, v2> are aggregated and created <k2, list(v2)>. In the word count example, a node in the Hadoop cluster may produce may <"aditi", List(1, 1, 1, 1)> from all the documents from different nodes. Eve- ry list(v2) for k2 passed to a node for reducing. The output will be list of <k3, v3>. For example, if a node receives "aditi" as k2, it just accumulates all List(1+1+1+1) as v2 and produces 4 as v3. Here, k3 is again "aditi". Each reducer node does the same for different words. The <k2, v2> aggregation is actually performed by a component called "combiner". As of now, let us keep focus on the mapper and reducer. See the below figure (figure 1): What are the layers of Architecture? What is MapReduce?
  • 4. 4www.aditi.com ABOUT ADITI Hadoop cluster is an infrastructure with many physical nodes, where some are configured for "mapping" and some are for "reducing" along with administra- tive, tracking and data persistence nodes called as "Name Node", "Job Track- er", "Task Tracker" and "Data Node" respectively. This is a master/slave archi- tecture "Name Node" and "Job Tracker" are masters and remaining are slaves. This is shown in figure 2. In order handle big data storage and processing, Hadoop uses HDFS as a file system which even handle 100 TB content as a single file. What are the layers of Architecture? Hadoop Cluster
  • 5. 5www.aditi.com ABOUT ADITI Since every task is called as "Job", you can rent required nodes for your job, use and release. Hence, the elastic computing and data storage (blob and ta- ble storage) in Azure is definitely the good choice for running your Hadoop job. The home land for Hadoop is Java, at this early stage on Azure, Hadoop Java SDK is one of the good options for your job. In addition to this, the "Hadoop on Azure" leverages the elasticity of Azure storage with Hadoop streaming, by which you can write your job on C# or F# and use Azure blob for data persistence (the scheme is called as ASV). The figure below shows the Hadoop ecosystem on Azure (figure 3). To create directories, get and put files, and issue some data processing com- mands on HDFS/ASV, Azure provides interactive JavaScript console. (In the ac- tual Hadoop distribution, Java is the main interface for this). In addition to this, Azure supports Hive (SQL like language in Hadoop) and Pig Latin (high level data processing language). What are the layers of Architecture? Hadoop Ecosystem on Azure
  • 6. 6www.aditi.com ABOUT ADITI The www.hadooponazure.com is the management portal to create, release and renew clusters for your job. The following are the steps you need to per- form to run job: 1. Develop the mapping and reducing functions either in Java or your pre- ferred platform. For non-Windows, it could be shell scripts, ruby, php, Py- thon, etc. In Azure, you can write the code in .NET. 2. Decide from where the input data and output result of the job need to be managed. Either in HDFS or Azure Blob. 3. Request a cluster for the job in the portal 4. Specify all the parameters for the job which includes the executable for the job, input and output details 5. Run the job and get the output 6. Release the cluster In this post, let us see the step 3, how we can create a cluster for a job. Requesting a new Cluster After you entered into the portal, you need to enter the following details for the new cluster environment as shown in the below figure (figure 4):  DNS name (<dnsname>.cloudapp.net)  Cluster size - like Azure role size. 4 nodes + 2 TB disk space = small, 32 nodes + 16 TB = extra large  Cluster login information What are the layers of Architecture? TheWeb Portal for Hadoop on Azure
  • 7. 7www.aditi.com ABOUT ADITI After entering these details, press Request Cluster button. This will create the cluster environment for your job. The screen shows the progress of creating new nodes for the cluster as shown in the below figure (figure 5):
  • 8. 8www.aditi.com ABOUT ADITI After the provisioning, you will see a screen as shown below (figure 6): You can start create a new job and if you want to access the environment you can use either "Interactive Console" or “Remote Desktop".
  • 9. 9www.aditi.com ABOUT ADITI The above figure is a Hadoop Streaming based job. —————————————————————————————————— About the Author: M Sheik Uduman Ali is a cloud architect at Aditi who has involved in cloud practic- es. He is a blogger and published an online book about "Domain Specific Languages in .NET". Aditi helps product companies, web businesses and enterprises leverage the power of cloud, e-social and mobile, to drive competitive advantage. We are Microsoft cloud partner of the year; one of the top 3 Platform-as-a-Service so- lution providers globally and one of the top 5 Microsoft technology partners in US. We are passionate about emerg- ing technologies and are focused on custom development. ABOUT ADITI When you click on new job, you will see the below screen (figure 7):