SlideShare une entreprise Scribd logo
1  sur  37
Télécharger pour lire hors ligne
HDInsight	
  Essentials	
  ISBN	
  :	
  1849695369	
  	
  /	
  ISBN	
  13	
  :	
  9781849695367	
  
Rajesh	
  Nadipalli	
  
05/01/2014	
  
Goals	
  of	
  this	
  Book	
  
• Focus	
  on	
  Microso'’s	
  new	
  Hadoop	
  
distribu=on	
  
• Serve	
  as	
  Quick	
  Reference	
  
• Provide	
  an	
  Overview	
  of	
  Hadoop	
  
• Address	
  both	
  cloud	
  and	
  on-­‐premise	
  setup	
  
for	
  HDInsight	
  
• Highlight	
  HDInsight	
  differen:ator	
  	
  
• Provide	
  Prac=cal	
  &	
  Real	
  world	
  examples	
  
Book	
  Table	
  of	
  Contents	
  
•  Chapter	
  1:	
  	
  HDInsight	
  in	
  a	
  Heartbeat	
  
•  Chapter	
  2:	
  	
  Deployment	
  HDInsight	
  on	
  premise	
  
•  Chapter	
  3:	
  	
  HDInsight	
  Azure	
  cloud	
  service	
  
•  Chapter	
  4:	
  	
  Administer	
  your	
  cluster	
  
•  Chapter	
  5:	
  	
  Ingest	
  data	
  to	
  your	
  cluster	
  
•  Chapter	
  6:	
  	
  Transform	
  data	
  in	
  your	
  cluster	
  
•  Chapter	
  7:	
  	
  Analyze	
  &	
  Report	
  data	
  from	
  cluster	
  
•  Chapter	
  8:	
  	
  Project	
  Planning	
  &	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Architectural	
  Considera=ons	
  
CHAPTER	
  1	
  HIGHLIGHTS:	
  	
  
HDINSIGHT	
  IN	
  A	
  HEARTBEAT	
  
Big	
  Data	
  Problem	
  Characteristics	
  	
  
Hadoop	
  Overview	
  
Self Healing
Distributed Storage
Fault Tolerant
Distributed
Computing
+
Abstraction for
Parallel Processing
CORE HADOOP COMPONENTS •  HDFS:	
  Distributed	
  
Storage	
  –	
  replicated,	
  
self-­‐healing	
  and	
  
scalable	
  
	
  
•  MapReduce:	
  	
  Parallel	
  
Processing,	
  process	
  
local	
  data	
  for	
  efficiency	
  	
  
	
  
NameNode
JobTracker
TaskTracker	
  
	
  
TaskTracker	
  
	
  
TaskTracker	
  
	
  MapReduce	
  
Layer	
  
Distributed	
  	
  
File	
  System	
  
Layer	
   Secondary
NameNode
Master	
  Node	
   Slaves	
  Nodes	
  
DataNode	
  
	
  
DataNode	
  
	
  
DataNode	
  
	
  
Hadoop	
  Nodes	
  Layout	
  
Data	
  Sources	
  
	
  
	
  
	
  
RDBMS	
  	
  
Databases	
  
Audio,	
  	
  
Images	
   Log	
  Files	
  
Sensors,	
  	
  
RFID	
  
Social	
  	
  
Media,	
  Feeds	
  
	
  
Hadoop	
  Data	
  Store	
  
	
  
	
  
	
  
	
  
HDFS	
  
Hbase	
  	
  (NOSQL	
  DB)	
  
	
  
Data	
  Processing	
  
	
  
	
  
	
  
Mapreduce	
  
	
  
Data	
  Access	
  
	
  
	
  
	
  
Hive	
   Pig	
  
Mahout	
  	
  
Machine	
  Learning	
  
Flume,	
  Sqoop	
  
Excel	
  
Business	
  	
  
Data	
  Feeds	
  
Zookeeper	
  (Distributed	
  Process	
  Management)	
  
Hcatalog	
  (Metadata	
  on	
  Pig,	
  Hive,	
  MapReduce	
  )	
  
Oozie	
  	
  
Workflow,	
  Scheduler	
  
Infrastructure	
  ,	
  Opera:ons	
  
(Monitoring,	
  Configura<on)	
  
Hadoop	
  Eco	
  System	
  
Collect & Import
to HDFS
Process
(MapReduce)
Analyze
(BI Tools)
Report & Publish
End	
  to	
  End	
  Solution	
  on	
  Hadoop	
  
Popular	
  Hadoop	
  Distributions	
  
•  Amazon	
  Elas=c	
  MapReduce	
  (cloud,	
  hbp://aws.amazon.com/
elas=cmapreduce/)	
  
	
  
•  Cloudera	
  (
hbp://www.cloudera.com/content/cloudera/en/home.html)	
  
	
  
•  EMC	
  PivitolHD	
  (hbp://gopivotal.com/)	
  
	
  
•  Hortonworks	
  HDP	
  (hbp://hortonworks.com/)	
  
	
  
•  MapR	
  (hbp://mapr.com/)	
  
	
  
•  Microsod	
  HDInsight	
  (cloud,	
  hbp://www.windowsazure.com/)	
  
HDInsight	
  Differenciator	
  
•  Enterprise-­‐ready	
  Hadoop	
  backed	
  by	
  Microsod	
  
	
  
•  Analy:cs	
  using	
  Excel	
  
•  Integra=on	
  with	
  Ac=ve	
  Directory.	
  
	
  	
  
•  Integra=on	
  with	
  .NET	
  and	
  Javascript	
  
	
  
•  Connectors	
  to	
  RDBMS	
  
	
  
•  Scale	
  using	
  cloud	
  offering:	
  	
  Azure	
  HDInsight	
  service	
  enables	
  customers	
  
to	
  scale	
  quickly	
  and	
  has	
  seamless	
  interface	
  between	
  HDFS	
  and	
  Azure	
  
Storage	
  Vault	
  
	
  
•  JavaScript	
  Console	
  
WordCount	
  in	
  HDInsight	
  
CHAPTER	
  2	
  HIGHLIGHTS:	
  	
  
HDINSIGHT	
  INSTALL	
  ON	
  PREMISE	
  
Apache	
  Hadoop	
  
	
  
	
  
	
  
•  Open	
  Source	
  Sodware	
  
•  Community	
  Development	
  
	
  	
  
Hortonworks	
  Data	
  PlaSorm	
  
	
  
	
  
	
  
•  Enterprise	
  Hadoop	
  Plagorm	
  (HDP)	
  
•  Leaders	
  in	
  Hadoop	
  
•  Code	
  commibers	
  to	
  Hadoop	
  
Microso'	
  HDInsight	
  
	
  
	
  
	
  
•  Built	
  on	
  top	
  of	
  HDP	
  
•  Integra=on	
  with	
  ASV,	
  Excel,	
  Powerview,	
  
SQLServer,	
  Ac=ve	
  Directory	
  
	
  	
  
HDInsight	
  Distribution	
  
Physical	
  Install	
  Options	
  
NN	
  	
  	
  	
  	
  SNN	
  	
  	
  	
  	
  	
  JT	
  
DN	
  	
  /	
  TT	
  
Single	
  node	
  for	
  development/test	
  	
  	
  
Mul=	
  node	
  for	
  produc=on	
  	
  	
  
Multi	
  Node	
  Install	
  Steps	
  
•  Pre-­‐requisites	
  
•  Networking	
  Setup	
  
•  Remote	
  Scrip=ng	
  
•  Firewall	
  Setup	
  
•  Sodware	
  Install	
  (each	
  node)	
  
•  Hadoop	
  Configura=on	
  
•  Verifica=on	
  
CHAPTER	
  3	
  HIGHLIGHTS:	
  	
  
HDINSIGHT	
  AZURE	
  SERVICE	
  
Azure	
  Cloud	
  Service	
  
Create	
  Storage	
  
Create	
  HDInsight	
  
cluster	
  
CHAPTER	
  4	
  HIGHLIGHTS:	
  	
  
ADMINISTER	
  YOUR	
  CLUSTER	
  
HDInsight	
  Cluster	
  Management	
  
HDInsight	
  Dashboard	
  
HDInsight	
  Dashboard	
  
NameNode	
  Status	
  
Jobtracker	
  Status	
  
CHAPTER	
  5	
  HIGHLIGHTS:	
  	
  
INGEST	
  DATA	
  INTO	
  YOUR	
  CLUSTER	
  
Loading	
  Data	
  into	
  your	
  Cluster	
  
You	
  have	
  following	
  op=ons…	
  
	
  
•  Loading	
  data	
  using	
  Hadoop	
  commands	
  
•  Loading	
  data	
  using	
  Azure	
  Storage	
  Vault	
  
•  Loading	
  data	
  using	
  Interac:ve	
  JavaScript	
  	
  
•  Shipping	
  data	
  to	
  your	
  Cluster	
  
•  Loading	
  data	
  from	
  RDBMS	
  via	
  Sqoop	
  
Loading	
  via	
  Azure	
  Storage	
  Explorer	
  
CHAPTER	
  6	
  HIGHLIGHTS:	
  	
  
TRANSFORM	
  YOUR	
  DATA	
  
Transforming	
  Data	
  
You	
  have	
  following	
  op=ons…	
  
	
  
•  MapReduce	
  
•  Hive	
  
•  Pig	
  
•  Others	
  
Processing	
  Data	
  in	
  Cluster	
  
Map for
Jan2012
Map for
Feb2012
Map for
Apr2013
…	
  
One Reducer
HDFS	
  
Hive	
  
JDBC/OBDC
Metastore
Thrift Server
Command LineWeb GUI
Driver
(Parser, Planner, Executor)
MapReduce	
  
Hive	
  
Raw	
  Data	
  in	
  HDFS	
  
•  Distributed	
  
Storage	
  
•  Reliable	
  
Data	
  Processing	
  via	
  Pig	
  
•  Pipelines	
  
•  Itera=ve	
  Processing	
  
•  Research	
  
Data	
  
Warehouse	
  
HDFS	
  
Data	
  Warehouse	
  via	
  Hive	
  
•  BI	
  Tools	
  
•  Analysis	
  
Hive	
  or	
  Pig?	
  
CHAPTER	
  7	
  HIGHLIGHTS:	
  	
  
ANALYZE	
  &	
  REPORT	
  
Analyze	
  using	
  Excel	
  
Analyze	
  using	
  Excel	
  
CHAPTER	
  8:	
  	
  
PROJECT	
  PLANNING	
  &	
  ARCHITECTURAL	
  
CONSIDERATIONS	
  
Execu:ve	
  &	
  
Stakeholder	
  	
  
Buy-­‐in	
  
Discovery	
  &	
  
Analysis	
  
Design	
  
Implementa:on	
  User	
  Acceptance	
  
Produc:on	
  
Opera:ons	
  
Feedback,	
  New	
  
Requirements	
  

Contenu connexe

Tendances

Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHortonworks
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analyticsjoshwills
 
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS CloudIdan Tohami
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBaseJames Serra
 
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Con LA
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoopRommel Garcia
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on AzureTrivadis
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaAzure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaDatabricks
 

Tendances (20)

Introduction to Azure HDInsight
Introduction to Azure HDInsightIntroduction to Azure HDInsight
Introduction to Azure HDInsight
 
Big Data with Azure
Big Data with AzureBig Data with Azure
Big Data with Azure
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS Cloud
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
Big data in Azure
Big data in AzureBig data in Azure
Big data in Azure
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
 
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure Databricks
 
Interactive query using hadoop
Interactive query using hadoopInteractive query using hadoop
Interactive query using hadoop
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaAzure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
 

Similaire à HDInsight Essentials Guide for Hadoop Professionals

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanJim Kaskade
 
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsightNaoki (Neo) SATO
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
 

Similaire à HDInsight Essentials Guide for Hadoop Professionals (20)

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
Hadoop
HadoopHadoop
Hadoop
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Robin_Hadoop
Robin_HadoopRobin_Hadoop
Robin_Hadoop
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 

Plus de nvvrajesh

Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetupnvvrajesh
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Information management and enterprise architecture
Information management and enterprise architectureInformation management and enterprise architecture
Information management and enterprise architecturenvvrajesh
 
Pentaho bi suite overview presentation
Pentaho bi suite overview   presentationPentaho bi suite overview   presentation
Pentaho bi suite overview presentationnvvrajesh
 
Social Networking for Non-Profits
Social Networking for Non-ProfitsSocial Networking for Non-Profits
Social Networking for Non-Profitsnvvrajesh
 
Oracle business intelligence overview
Oracle business intelligence overviewOracle business intelligence overview
Oracle business intelligence overviewnvvrajesh
 
BI the Agile Way
BI the Agile WayBI the Agile Way
BI the Agile Waynvvrajesh
 
Agile Process in a Nutshell
Agile Process in a NutshellAgile Process in a Nutshell
Agile Process in a Nutshellnvvrajesh
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprisesnvvrajesh
 

Plus de nvvrajesh (9)

Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetup
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Information management and enterprise architecture
Information management and enterprise architectureInformation management and enterprise architecture
Information management and enterprise architecture
 
Pentaho bi suite overview presentation
Pentaho bi suite overview   presentationPentaho bi suite overview   presentation
Pentaho bi suite overview presentation
 
Social Networking for Non-Profits
Social Networking for Non-ProfitsSocial Networking for Non-Profits
Social Networking for Non-Profits
 
Oracle business intelligence overview
Oracle business intelligence overviewOracle business intelligence overview
Oracle business intelligence overview
 
BI the Agile Way
BI the Agile WayBI the Agile Way
BI the Agile Way
 
Agile Process in a Nutshell
Agile Process in a NutshellAgile Process in a Nutshell
Agile Process in a Nutshell
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprises
 

Dernier

Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 

Dernier (20)

Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 

HDInsight Essentials Guide for Hadoop Professionals

  • 1. HDInsight  Essentials  ISBN  :  1849695369    /  ISBN  13  :  9781849695367   Rajesh  Nadipalli   05/01/2014  
  • 2. Goals  of  this  Book   • Focus  on  Microso'’s  new  Hadoop   distribu=on   • Serve  as  Quick  Reference   • Provide  an  Overview  of  Hadoop   • Address  both  cloud  and  on-­‐premise  setup   for  HDInsight   • Highlight  HDInsight  differen:ator     • Provide  Prac=cal  &  Real  world  examples  
  • 3. Book  Table  of  Contents   •  Chapter  1:    HDInsight  in  a  Heartbeat   •  Chapter  2:    Deployment  HDInsight  on  premise   •  Chapter  3:    HDInsight  Azure  cloud  service   •  Chapter  4:    Administer  your  cluster   •  Chapter  5:    Ingest  data  to  your  cluster   •  Chapter  6:    Transform  data  in  your  cluster   •  Chapter  7:    Analyze  &  Report  data  from  cluster   •  Chapter  8:    Project  Planning  &                                              Architectural  Considera=ons  
  • 4. CHAPTER  1  HIGHLIGHTS:     HDINSIGHT  IN  A  HEARTBEAT  
  • 5. Big  Data  Problem  Characteristics    
  • 6. Hadoop  Overview   Self Healing Distributed Storage Fault Tolerant Distributed Computing + Abstraction for Parallel Processing CORE HADOOP COMPONENTS •  HDFS:  Distributed   Storage  –  replicated,   self-­‐healing  and   scalable     •  MapReduce:    Parallel   Processing,  process   local  data  for  efficiency      
  • 7. NameNode JobTracker TaskTracker     TaskTracker     TaskTracker    MapReduce   Layer   Distributed     File  System   Layer   Secondary NameNode Master  Node   Slaves  Nodes   DataNode     DataNode     DataNode     Hadoop  Nodes  Layout  
  • 8. Data  Sources         RDBMS     Databases   Audio,     Images   Log  Files   Sensors,     RFID   Social     Media,  Feeds     Hadoop  Data  Store           HDFS   Hbase    (NOSQL  DB)     Data  Processing         Mapreduce     Data  Access         Hive   Pig   Mahout     Machine  Learning   Flume,  Sqoop   Excel   Business     Data  Feeds   Zookeeper  (Distributed  Process  Management)   Hcatalog  (Metadata  on  Pig,  Hive,  MapReduce  )   Oozie     Workflow,  Scheduler   Infrastructure  ,  Opera:ons   (Monitoring,  Configura<on)   Hadoop  Eco  System  
  • 9. Collect & Import to HDFS Process (MapReduce) Analyze (BI Tools) Report & Publish End  to  End  Solution  on  Hadoop  
  • 10. Popular  Hadoop  Distributions   •  Amazon  Elas=c  MapReduce  (cloud,  hbp://aws.amazon.com/ elas=cmapreduce/)     •  Cloudera  ( hbp://www.cloudera.com/content/cloudera/en/home.html)     •  EMC  PivitolHD  (hbp://gopivotal.com/)     •  Hortonworks  HDP  (hbp://hortonworks.com/)     •  MapR  (hbp://mapr.com/)     •  Microsod  HDInsight  (cloud,  hbp://www.windowsazure.com/)  
  • 11. HDInsight  Differenciator   •  Enterprise-­‐ready  Hadoop  backed  by  Microsod     •  Analy:cs  using  Excel   •  Integra=on  with  Ac=ve  Directory.       •  Integra=on  with  .NET  and  Javascript     •  Connectors  to  RDBMS     •  Scale  using  cloud  offering:    Azure  HDInsight  service  enables  customers   to  scale  quickly  and  has  seamless  interface  between  HDFS  and  Azure   Storage  Vault     •  JavaScript  Console  
  • 13. CHAPTER  2  HIGHLIGHTS:     HDINSIGHT  INSTALL  ON  PREMISE  
  • 14. Apache  Hadoop         •  Open  Source  Sodware   •  Community  Development       Hortonworks  Data  PlaSorm         •  Enterprise  Hadoop  Plagorm  (HDP)   •  Leaders  in  Hadoop   •  Code  commibers  to  Hadoop   Microso'  HDInsight         •  Built  on  top  of  HDP   •  Integra=on  with  ASV,  Excel,  Powerview,   SQLServer,  Ac=ve  Directory       HDInsight  Distribution  
  • 15. Physical  Install  Options   NN          SNN            JT   DN    /  TT   Single  node  for  development/test       Mul=  node  for  produc=on      
  • 16. Multi  Node  Install  Steps   •  Pre-­‐requisites   •  Networking  Setup   •  Remote  Scrip=ng   •  Firewall  Setup   •  Sodware  Install  (each  node)   •  Hadoop  Configura=on   •  Verifica=on  
  • 17. CHAPTER  3  HIGHLIGHTS:     HDINSIGHT  AZURE  SERVICE  
  • 18. Azure  Cloud  Service   Create  Storage   Create  HDInsight   cluster  
  • 19. CHAPTER  4  HIGHLIGHTS:     ADMINISTER  YOUR  CLUSTER  
  • 25. CHAPTER  5  HIGHLIGHTS:     INGEST  DATA  INTO  YOUR  CLUSTER  
  • 26. Loading  Data  into  your  Cluster   You  have  following  op=ons…     •  Loading  data  using  Hadoop  commands   •  Loading  data  using  Azure  Storage  Vault   •  Loading  data  using  Interac:ve  JavaScript     •  Shipping  data  to  your  Cluster   •  Loading  data  from  RDBMS  via  Sqoop  
  • 27. Loading  via  Azure  Storage  Explorer  
  • 28. CHAPTER  6  HIGHLIGHTS:     TRANSFORM  YOUR  DATA  
  • 29. Transforming  Data   You  have  following  op=ons…     •  MapReduce   •  Hive   •  Pig   •  Others  
  • 30. Processing  Data  in  Cluster   Map for Jan2012 Map for Feb2012 Map for Apr2013 …   One Reducer
  • 31. HDFS   Hive   JDBC/OBDC Metastore Thrift Server Command LineWeb GUI Driver (Parser, Planner, Executor) MapReduce   Hive  
  • 32. Raw  Data  in  HDFS   •  Distributed   Storage   •  Reliable   Data  Processing  via  Pig   •  Pipelines   •  Itera=ve  Processing   •  Research   Data   Warehouse   HDFS   Data  Warehouse  via  Hive   •  BI  Tools   •  Analysis   Hive  or  Pig?  
  • 33. CHAPTER  7  HIGHLIGHTS:     ANALYZE  &  REPORT  
  • 36. CHAPTER  8:     PROJECT  PLANNING  &  ARCHITECTURAL   CONSIDERATIONS  
  • 37. Execu:ve  &   Stakeholder     Buy-­‐in   Discovery  &   Analysis   Design   Implementa:on  User  Acceptance   Produc:on   Opera:ons   Feedback,  New   Requirements