1. Building big data applications on
Azure
Pranav Rastogi/ Bharath Sreenivas
Microsoft
pranav.rastogi@microsoft.com
@rustd/ @bharathbs
2.
3. Security and privacyFlexibility of choiceReason over any data, anywhere
Data warehouses
Data lakes
Operational databases
Hybrid
Data warehouses
Data lakes
Operational databases
SocialLOB Graph IoTImageCRM
6. Solution scenarios
Three scenarios that take optimal advantage of Big Data
Modern DW
“We want to incorporate all
of our data including ‘big
data” with our data
warehouse”
Advanced Analytics
“We are trying to predict
when our customers churn.”
Internet of Things (IoT)
“We are trying to get insights
from our devices in real-time,
etc.”
7. Governance and
Master Data Management
Azure SQL Data Warehouse
Data Quality and
Lineage
ERP, CRM,
and other
LOB Data
OLTP and
other
RDBMS
Clickstream
Logs and
Events
Sensors,
Social,
Weather, and
other un-
structured
data
ETL
Azure Data Lake
Analytics (U-SQL)
Azure Storage / Azure Data Lake
Azure HDInsight
(Hadoop / Spark)
Azure Analysis
Services
BI Models
Power BI
Reports and
Dashboards
Polybase
Analyst
Power User
Data Engineer
Data Scientist
Big Data Warehouse
8. OLTP and
other
RDBMS
Clickstream
Logs and
Events
Sensors,
Social,
Weather, and
other un-
structured
data
REPL and
Machine
Learning Tools
Data
Wrangling
Tools
Data Engineer Data Scientist
Deep Learning
& Cognitive
Services
Azure
Cosmos DB
Apps
Automated
Systems
People
Web
Mobile
Bots
ML Models
and Scoring
APIs
Advanced Analytics and AI
Azure Data Lake
Analytics (U-SQL)
Azure Storage / Azure Data Lake
Azure HDInsight
(Hadoop / Spark)
9. Azure Stream Analytics / Spark Streaming
Clean,
Curate,
Aggregate
Combine
reference
data
Perform
Scoring from
ML models
IoT Sensors
and/or
User
activity
streams
Social,
Trends,
Weather
etc.
Clickstream,
Batch Files,
server logs,
Images,
videos, and
other
unstructured
data
Azure Event Hubs,
Apache Kafka
Event
Broker/Buffer
Queue
Event
Broker
Power BI
Realtime
Dashboards
Analyst
Data Engineer
Data Scientist
Azure ML / R
Trained Machine
Learning Models
Azure SQL DB /
Cosmos DB
Reference Data
Automated
Systems
Realtime Processing with Lambda Architecture
Azure Data Lake
Analytics (U-SQL)
Azure Storage / Azure Data Lake
Azure HDInsight
(Hadoop / Spark)
10. A d v a n c e d a n a l y t i c s a n d b i g d a t a
i m p a c t s a l l v e r t i c a l s
Heartland Bank prevents fraud
and boosts profits
The UK NHS transforms healthcare
with faster access to information.
City of Barcelona boosts citizen
unsegmented with intelligent app
Jet.com transforms customer engagement
with truly aerosolized experience
Rolls Royce decreases costs with
Predictive Maintenance
Manufacturing
Eliminate downtime and
increase efficiency by enabling
better predictive maintenance
for your capital assets.
Banking
Minimize losses with more
accurate fraud detection and
assess exposure to asset,
credit and market risk using a
holistic approach
Boost operational efficiency
and improve patient acre
experience with intelligent
detection and in time service.
Healthcare Government
Empower citizens and
improve their engagement
with relevant information and
personalized citizen services.
Retail
Turn individual customer
interactions into contextual
engagements and increase
customer satisfaction with highly
personalized offers and content
11.
12. Managed Open Source Analytics for the
cloud with a 99.9% SLA.
100% Open Source
Clusters up and running in minutes
63% lower TCO than deploy your own Hadoop on-
premises
Separation of compute and store allows you to scale
clusters to exponentially reduce costs
Open Source Analytics for the Enterprise
13. Big data is hard
Buy
Servers
Install
OSS
Secure Configure
Optimize
Debug
Success
Scale up
14. HDInsight makes it easy
Provide
Cluster
details
HDInsight
Cluster
100% open source
Optimized
Highly available
Secure
Scalable
Dedicated
Managed
Certified ISVs
Customizable
Browse to
Azure Portal
15. Multi Region Availability
Available in >25 regions world-wide
Launched most recently in US West 2, and UK regions
Available in China, Europe and US Government clouds
Deploy Globally Within Minutes
16. Perimeter Level Security
Virtual Networks
Network Security Groups (firewalls)
Authentication
Azure Active Directory
Kerberos authentication
Authorization
Apache Ranger
RBAC for Admin
POSIX ACLs for Data Plane Data Security
Server-Side encryption at rest
HTTPS/TLS In-transit
Security and Compliance to Enable OSS for Enterprises
17. Plugins for HDI available for most popular IDEs for agile
development and debugging
Rich support for powerful notebooks used by data
scientists
Develop in C#, deploy on Linux in Java via HDI
developed SCP.Net technology
Remote Debugging for Spark jobs
Rich Developer Ecosystem
18. Recognized by
Top Analysts
Forrester Wave for Big Data
Hadoop Cloud
• Named industry leader by
Forrester with the most
comprehensive, scalable, and
integrated platforms*
• Recognized for its cloud-first
strategy that is paying off*
*The Forrester WaveTM: Big Data Hadoop Cloud Solutions, Q2 2016.
19. Products and Services Organization Size Industry Country Business Need
Simplified pricing process
now takes minutes instead
of days
Competitive pricing, product demand, the costs of materials, gas and
labor, and the thousands of other market variables affect product cost
and customer demand for products or services around the world. It’s
why accurate and profitable pricing represents one of the most
difficult business challenges for many companies. Manufacturing,
distribution, services, and airline companies look to the science and
technology provided by PROS to keep their pricing accurate,
competitive, and profitable. The PROS Guidance product runs
enormously complex pricing calculations based on variables that
comprise multiple terabytes of data. To handle this calculation
complexity and data volume, and then deliver specific results to its
clients quickly, PROS built its services on top of Azure HDInsight.
Pricing Software-
as-a-Service
United StatesOther-
unsegmented
1,000Microsoft Azure
Azure HDInsight
Apache Spark for Azure
HDInsight
20. HDInsight architecture
Hive meta store
Azure SQL database
Azure Storage or
Data Lake Store
Client
machines
HDInsight cluster
Gateway
nodes
Head
nodes
Worker
nodes
Edge
nodes
Zookeeper nodes
21. Scale compute & storage independently
Gateway
nodes
Head
nodes
Worker
nodes
Edge
nodes
Zookeeper nodes
Azure Blob Storage
or
Azure Data Lake
Store
22. Persist & Reuse your data
Your data is outside the
HDInsight cluster.
Hence data is persisted
even if you drop and
recreate the cluster.
Create multiple clusters
and point to same storage.
Azure Blob Storage
or
Azure Data Lake
Store
HDInsight
cluster
HDInsight
cluster
HDInsight
cluster
HDInsight
cluster
27. Azure
Blob
Storage
HDInsight Spark cluster
Azure SQL
Data Warehouse
Azure SQL
Database
Azure Data Lake
Store
Azure Cosmos
DB
Azure SQL
Database
Azure
Blob
Storage
Azure SQL
Data Warehouse
Azure Data Lake
Store
Azure Cosmos
DB
jobs
35. HDInsight Spark cluster
streaming jobs
Web app
Mobile
Azure
Blob
Storage
Kafka
Event Hub
Azure Data Lake
Store
Azure Cosmos
DB
Azure SQL
Database
HBase
push pull
Azure Redis Cache
Bot
41. Reads from
HDFS
Writes to
HDFS
Reads from
HDFS
Writes to
HDFSStep 1
“mapper”
Step 2
“reducer”
Step 1
Reads and writes
from HDFS
Read 1MB
sequentially from
disk
20,000,000 ns
Read 1 MB
sequentially from
SSD
1,000,000 ns
Read 1 MB
sequentially from
memory
250,000 ns
44. val file = spark.textFile(“wasb://...")
val errors = file.filter(line => line.contains("ERROR"))
// Cache errors
errors.cache()
// Count all the errors
errors.count()
// Count errors mentioning MySQL
errors.filter(line => line.contains(“Web")).count()
// Fetch the MySQL errors as an array of strings
errors.filter(line => line.contains(“Error")).collect()
55. Azure
Blob
Storage
HDInsight Spark cluster
Azure SQL
Data Warehouse
Azure SQL
Database
Azure Data Lake
Store
Azure Cosmos
DB
Azure SQL
Database
Azure
Blob
Storage
Azure SQL
Data Warehouse
Azure Data Lake
Store
Azure Cosmos
DB
jobs
57. HDInsight Spark cluster
streaming jobs
Web app
Mobile
Azure
Blob
Storage
Azure Data Lake
Store
Azure Cosmos
DB
Azure SQL
Database
HBase
push pull
Azure Redis Cache
Bot
Power BI
real-time
dashboard
Kafka
Event Hub
68. Phone Tracking Across Cell Sites
Connected Car - Remote
Management & Diagnostics
Asset Tracking
Fleet Management
Facilities Management
Personnel Tracking & Crowd
Control
Ride Sharing
Geofencing
Racecar Telemetry
Connected Manufacturing
and many more…
69. Data Sources Ingest Prepare
(normalize, clean, etc.)
Analyze
(stat analysis, ML, etc.)
Publish
(for programmatic
consumption, BI/visualization)
Consume
(Alerts, Operational Stats,
Insights)
Big Data Architecture
Data Consumption
(Ingestion)
Data Processing
Presentation/Serving
Layer
70. Data Sources Ingest Prepare
(normalize, clean, etc.)
Analyze
(stat analysis, ML, etc.)
Publish
(for programmatic
consumption, BI/visualization)
Consume
(Alerts, Operational Stats,
Insights)
Big Data Architecture
Data Processing
REALTIME ANALYTICS
INTERACTIVE ANALYTICS
BATCH ANALYTICS
Machine Learning
(Spark + Azure ML)
(Failure and RCA
Predictions)
HDI + ISVs
OLAP for Data
Warehousing
HDI Custom ETL
Aggregate /Partition
PowerBI
dashboard
(Shared with field
Ops, customers,
MIS, and Engineers)
Realtime Machine Learning
(Anomaly Detection)
CosmosDB
Interactive HDInsight clusters
BIG DATA STORAGE ANALYTICS
Big Data Storage
Azure Data
Lake Store
CosmosDB Azure Blob
Storage
Data Scientists,
BI Analysts
Big Data Applications
80. Microsoft Databus
(Siphon) Usage 8 million
EVENTS PER SECOND PEAK INGRESS
800 TB (10 GB per Sec)
INGRESS PER DAY
1,800; 450
PRODUCTION KAFKA BROKERS; TOPICS
15 Sec
99th PERCENTILE LATENCY
KEY CUSTOMER SCENARIOS
Ads Monetization (Fast BI)
O365 Customer Fabric NRT – Tenant & User insights
BingNRT Operational Intelligence
Presto (Fast SML) interactive analysis
Delve Analytics
0
5
10
15
20
25
30
35
40
45
Jan-15
Feb-15
Mar-15
Apr-15
May-15
Jun-15
Jul-15
Aug-15
Sep-15
Oct-15
Nov-15
Dec-15
Jan-16
Feb-16
Mar-16
Apr-16
May-16
Jun-16
Jul-16
Aug-16
Sep-16
Oct-16
Nov-16
Dec-16
Throughput(inGBps)
Siphon Data Volume (Ingress and Egress)
Volume published (GBps) Volume subscribed (GBps)
0
5
10
15
20
25
Jan-15
Feb-15
Mar-15
Apr-15
May-15
Jun-15
Jul-15
Aug-15
Sep-15
Oct-15
Nov-15
Dec-15
Jan-16
Feb-16
Mar-16
Apr-16
May-16
Jun-16
Jul-16
Aug-16
Sep-16
Oct-16
Nov-16
Dec-16
Throughput(eventspersec)Millions
Siphon Events per second (Ingress and Egress)
EPS In Eps Out
81. Asia DC
Zookeeper Canary
Kafka
Collector
Agent
Services Data Pull (Agent)
Services Data Push
Device Proxy Services
Consumer
API (Push/
Pull)
Europe DC
Zookeeper Canary
Kafka
US DC
Zookeeper Canary
Kafka
Streaming
Batch
Audit Trail
Open Source
Microsoft Internal
Siphon
82.
83.
84.
85.
86. Tool Purpose
Ambari Dashboard for monitoring health and status of the
Hadoop cluster
Yarn UI Monitor Yarn Application and logs
Tez View Track and debug the execution of jobs
Grafana Workload specific JMX metrics
Spark History Server The history server displays both completed and
incomplete Spark jobs
HMaster UI HBase provides a web-based user interface that you
can use to monitor your HBase cluster
Visual Studio /VS Code Monitor a Job status in VS with DataLake tools. Spark
Remote Job debugging
87.
88.
89. OMS Agent for
Linux
HDInsight nodes (Head, Worker ,
Zookeeper )
FluentD
HDInsight
plugin
1. Plugin for ‘in_tail’ for all Logs, allows
regexp to create JSON object
2. Filter for WARN and above for each
Log Type. `grep` filter plugin
3. Output to out_oms_api Type
4. Exec plugin for Metrics
HBaseConfigomsconfig
Spark
Hive
Storm
Kafka
Config
Config
Config
Config
Log Analytics(OMS) Service
112. Transparent Server Side Encryption
Azure Data Lake Storage
ALWAYS ON transparent encryption
All reads/writes are encrypted/decrypted
Service managed keys as well as Customer
managed keys
Encryption @ Rest and Encryption in Transit
Microsoft Azure Storage Blob
ALWAYS ON transparent encryption
All reads/writes are encrypted/decrypted
Service managed keys as well as Customer managed keys
Encryption @ Rest and Encryption in Transit
All kinds of data being generated
Stored on-premises and in the cloud – but vast majority in hybrid
Reason over all this data without requiring to move data
They want a choice of platform and languages, privacy and security
<Transition> Microsoft’s offerng
Objective: This slide describes the architecture of how Apache Spark is different, allowing it to offer better performance for data sharing.
Table Source: https://gist.github.com/jboner/2841832
Talking points:
Spark provides primitives for in-memory cluster computing. A Spark job can load and cache data into memory and query it repeatedly, much more quickly than disk-based systems.
Spark integrates into the Scala programming language to let you manipulate distributed data sets like local collections. No need to structure everything as map and reduce operations.
Data sharing between operations is faster, since data is in-memory.
Hadoop shares data through HDFS, an expensive option. It also maintains three replicas.
Spark stores data in-memory without any replication.
Objective: This slide explains the two types of operations that RDDs support: transformation and actions.
Talking points:
Transformations create a new data set from an existing data set.
Transformations do not compute their results right away. They are only computed when an action requires a result to be returned to the driver program. Does not apply to persistent RDDs.
Examples include: map, filter, sample, union, and more.
Actions return a value to the driver program after running a computation on the data set.
Examples include: reduce, collect, count, first, foreach, and more.
Objective: This slide shows an example of how transformations and actions are enabled to search through error messages.
Talking points:
Cache errors – Implementing this action will collect all the errors present
Count all errors – Implementing this action counts all the errors in the data
Count errors mentioning MySQL – When implementing this code, MySQL errors are counted
Fetch the MySQL errors as an array of strings – When implementing this code, MySQL errors are extracted as an array of strings
Event Detection in Realtime
FINANCIAL ENGINES
CONNECTED CAR – SENSORS FIRE
Data Landing for Learning
Use cases
Connected Car Insurance companies for Connected Driving
What are the three Big components that You need to stand up when you
ASK:
Who knows what Lambda architecture is
Who has helped implement one?
Walk through
VERTICALS
Ingest
Prep + Analyze
Serve
Consume
Horizontals
Drive by speed – realtime vs Batch
What are the three Big components that You need to stand up when you
ASK:
Who knows what Lambda architecture is
Who has helped implement one?
Walk through
VERTICALS
Ingest
Prep + Analyze
Serve
Consume
Horizontals
Drive by speed – realtime vs Batch
Let’s Walk through an example of this
We will demo this soon
We will demo this soon
TODO – add logos for Bing Ads, Office365, Delve Analytics
How to monitor all of our resources across subscriptions with single pane of glass?
How to Analyze Hadoop Logs & Metrics easily?
How to setup alerting?