These slides provide highlights of my book HDInsight Essentials. Book link is here: http://www.packtpub.com/establish-a-big-data-solution-using-hdinsight/book
2. Goals
of
this
Book
• Focus
on
Microso'’s
new
Hadoop
distribu=on
• Serve
as
Quick
Reference
• Provide
an
Overview
of
Hadoop
• Address
both
cloud
and
on-‐premise
setup
for
HDInsight
• Highlight
HDInsight
differen:ator
• Provide
Prac=cal
&
Real
world
examples
3. Book
Table
of
Contents
• Chapter
1:
HDInsight
in
a
Heartbeat
• Chapter
2:
Deployment
HDInsight
on
premise
• Chapter
3:
HDInsight
Azure
cloud
service
• Chapter
4:
Administer
your
cluster
• Chapter
5:
Ingest
data
to
your
cluster
• Chapter
6:
Transform
data
in
your
cluster
• Chapter
7:
Analyze
&
Report
data
from
cluster
• Chapter
8:
Project
Planning
&
Architectural
Considera=ons
8. Data
Sources
RDBMS
Databases
Audio,
Images
Log
Files
Sensors,
RFID
Social
Media,
Feeds
Hadoop
Data
Store
HDFS
Hbase
(NOSQL
DB)
Data
Processing
Mapreduce
Data
Access
Hive
Pig
Mahout
Machine
Learning
Flume,
Sqoop
Excel
Business
Data
Feeds
Zookeeper
(Distributed
Process
Management)
Hcatalog
(Metadata
on
Pig,
Hive,
MapReduce
)
Oozie
Workflow,
Scheduler
Infrastructure
,
Opera:ons
(Monitoring,
Configura<on)
Hadoop
Eco
System
9. Collect & Import
to HDFS
Process
(MapReduce)
Analyze
(BI Tools)
Report & Publish
End
to
End
Solution
on
Hadoop
11. HDInsight
Differenciator
• Enterprise-‐ready
Hadoop
backed
by
Microsod
• Analy:cs
using
Excel
• Integra=on
with
Ac=ve
Directory.
• Integra=on
with
.NET
and
Javascript
• Connectors
to
RDBMS
• Scale
using
cloud
offering:
Azure
HDInsight
service
enables
customers
to
scale
quickly
and
has
seamless
interface
between
HDFS
and
Azure
Storage
Vault
• JavaScript
Console
14. Apache
Hadoop
• Open
Source
Sodware
• Community
Development
Hortonworks
Data
PlaSorm
• Enterprise
Hadoop
Plagorm
(HDP)
• Leaders
in
Hadoop
• Code
commibers
to
Hadoop
Microso'
HDInsight
• Built
on
top
of
HDP
• Integra=on
with
ASV,
Excel,
Powerview,
SQLServer,
Ac=ve
Directory
HDInsight
Distribution
15. Physical
Install
Options
NN
SNN
JT
DN
/
TT
Single
node
for
development/test
Mul=
node
for
produc=on
26. Loading
Data
into
your
Cluster
You
have
following
op=ons…
• Loading
data
using
Hadoop
commands
• Loading
data
using
Azure
Storage
Vault
• Loading
data
using
Interac:ve
JavaScript
• Shipping
data
to
your
Cluster
• Loading
data
from
RDBMS
via
Sqoop
32. Raw
Data
in
HDFS
• Distributed
Storage
• Reliable
Data
Processing
via
Pig
• Pipelines
• Itera=ve
Processing
• Research
Data
Warehouse
HDFS
Data
Warehouse
via
Hive
• BI
Tools
• Analysis
Hive
or
Pig?