SlideShare a Scribd company logo
1 of 35
Sponsored & Brought to you by
A Lap around Azure Data Factory
Martin Abbott
http://www.twitter.com/martinabbott
https://au.linkedin.com/in/mjabbott
A Lap around Azure Data Factory
Martin Abbott
@martinabbott
About me
10+ years experience
Integration, messaging and cloud person
Organiser of Perth Microsoft Cloud User Group
Member of GlobalAzure Bootcamp admin team
BizTalk developer and architect
Identity management maven
IoT enthusiast
Soon to be Australian Citizen
Agenda
Overview
Data movement
Data transformation
Development
Monitoring
Demonstration
General information
Overview of an Azure Data Factory
Overview of an Azure Data Factory
• Cloud based data integration
• Orchestration and transformation
• Automation
• Large volumes of data
• Part of Cortana Analytics Suite Information Management
• Fully managed service, scalable, reliable
Anatomy of an Azure Data Factory
An Azure Data Factory is made up of:
Linked services
• Represents either
• a data store
• File system
• On-premises SQL Server
• Azure storage
• Azure DocumentDB
• Azure Data Lake Store
• etc.
• a compute resource
• HDInsight (own or on demand)
• Azure Machine Learning Endpoint
• Azure Batch
• Azure SQL Database
• Azure Data Lake Analytics
Data sets
• Named references to data
• Used for both input and output
• Identifies structure
• Files, tables, folders, documents
• Internal or external
• Use SliceStart and SliceEnd
system variables to create
distinct slices on output data
sets, e.g., unique folder based on
date
Activities
• Define actions to perform on data
• Zero or more input data sets
• One or more output data sets
• Unit of orchestration of a pipeline
• Activities for
• data movement
• data transformation
• data analysis
• Use WindowStart and WindowEnd
system variables to select relevant
data using a tumbling window
Pipelines
• Logical grouping of activities
• Provides a unit of work that
performs a task
• Can set active period to run
in the past to back fill data
slices
• Back filling can be performed
in parallel
Scheduling
• Data sets have an availability
"availability": { "frequency": "Hour", "interval": 1 }
• Activities have a schedule (tumbling window)
"scheduler": { "frequency": "Hour", "interval": 1 }
• Pipelines have an active period
"start": "2015-01-01T08:00:00Z"
"end": "2015-01-01T11:00:00Z“ OR
“end” = “start” + 48 hours if not specified OR
“end”: “9999-09-09” for indefinite
Data Lineage / Dependencies
• How does Azure Data Factory know how to link
Pipelines?
• Uses Input and Output data sets
• On the Diagram in portal, can toggle data lineage on and
off
• external required (and externalData policy optional) for
data sets created outside Azure Data Factory
• How does Azure Data Factory know how to link data
sets that have different schedules?
• Uses startTime, endTime and dependency model
Functions
• Rich set of functions to
• Specify data selection queries
• Specify input data set dependencies
• [startTime, endTime] – data set slice
• [f(startTime, endTime), g(startTime, endTime)] – dependency
period
• Use system variables as parameters
• Functions for text formatting and date/time selection
• Text.Format('{0:yyyy}',WindowStart)
• Date.AddDays(SliceStart, -7 - Date.DayOfWeek(SliceStart))
Data movement
Data movement
SOURCE SINK
Azure Blob Azure Blob
Azure Table Azure Table
Azure SQL Database Azure SQL Database
Azure SQL Data Warehouse Azure SQL Data Warehouse
Azure DocumentDB Azure DocumentDB
Azure Data Lake Store Azure Data Lake Store
SQL Server on-premises / Azure IaaS SQL Server on-premises / Azure IaaS
File System on-premises / Azure IaaS File System on-premises / Azure IaaS
Oracle Database on-premises / Azure IaaS
MySQL Database on-premises / Azure IaaS
DB2 Database on-premises / Azure IaaS
Teradata Database on-premises / Azure IaaS
Sybase Database on-premises / Azure IaaS
PostgreSQL Database on-premises / Azure IaaS
Data movement
• Uses the Copy activity and Data Movement Service or Data Management Gateway (for on-premises or
Azure IaaS)
• Globally available service for data movement (except Australia)
• executes at sink location, unless source is on-premises (or IaaS) then uses Data Management Gateway
• Exactly one input and exactly one output
• Support for securely moving between on-premises and the cloud
• Automatic type conversions from source to sink data types
• File based copy supports binary, text and Avro formats, and allows for conversion between formats
• Data Management Gateway supports multiple data sources but only a single Azure Data Factory
Source
Data Movement Service
WAN Serialisation-
Deserialisation
Compression
Column
Mapping
…
WAN Sink
Source
Data Management Gateway
LAN/
WAN Serialisation-
Deserialisation
Compression
Column
Mapping
…
SinkLAN/
WAN
Data analysis and transformation
Data analysis and transformation
TRANSFORMATION ACTIVITY COMPUTE ENVIRONMENT
Hive HDInsight [Hadoop]
Pig HDInsight [Hadoop]
MapReduce HDInsight [Hadoop]
Hadoop Streaming HDInsight [Hadoop]
Machine Learning activities: Batch Execution and
Update Resource
Azure VM
Stored Procedure Azure SQL Database
Data Lake Analytics U-SQL Azure Data Lake Analytics
DotNet HDInsight [Hadoop] or Azure Batch
Data analysis and transformation
• Two types of compute environment
• On-demand: Data Factory fully manages environment, currently HDInsight only
• Set timeToLive to set allowed idle time once job finishes
• Set osType for windows or linux
• Set clusterSize to determine number of nodes
• Provisioning an HDInsight cluster on-demand can take some time
• Bring your own: Register own computing environment for use as a linked service
• HDInsight Linked Service
• clusterUri, username, password and location
• Azure Batch Linked Service
• accountName, accessKey and poolName
• Machine Learning Linked Service
• mlEndpoint and apiKey
• Data Lake Analytics Linked Service
• accountName, dataLakeAnalyticsUri and authorization
• Azure SQL Database Linked Service
• connectionString
Development
Development
• JSON for all artefacts
• Ease of management by source control
• Can be developed using:
• Data Factory Editor
• In Azure Portal
• Create and deploy artefacts
• PowerShell
• Cmdlets for each main function in PS ARM
• Visual Studio
• Azure Data Factory Templates
• .NET SDK
Visual Studio
• Rich set of templates including
• Sample applications
• Data analysis and transformation using Hive
and Pig
• Data movement between typical
environments
• Can include sample data
• Can create Azure Data Factory, storage
and compute resources
• Can Publish to Azure Data Factory
• No toolbox, mostly hand crafting JSON
Tips and Tricks with Visual Studio Templates
• Something usually fails
• Issues with sample data
• Run once to create Data Factory and storage accounts
• Usually first run will also create a folder containing Sample Data but NO JSON
artifacts
• May need to manually edit PowerShell scripts or perform manual upload
• Once corrected, deselect Sample Data and run again creating new solution
• Ensure Publish to Data Factory is deselected and JSON artifacts are created
• Issues with Data Factory deployment
• Go to portal and check what failed
• May need to manually create item but deleting published item and recreating with
JSON from project
• When deploying, may need to unselect item that is failing
• You cannot delete from the project
• Need to Exclude From Project
• Once excluded can delete from disk
Deployment
• Add Config files to your Visual Studio
project
• Deployment files contain, for instance,
connection strings to resources that
are replaced at Publish time
• Add deployment files for each
environment you are deploying to,
e.g., Dev, UAT, Prod
• When publishing to Azure Data
Factory choose appropriate Config file
to ensure correct settings are applied
• Publish only artefacts required
Monitoring
Monitoring
• Data slices may fail
• Drill in to errors, diagnose, fix and rerun
• Failed data slices can be rerun and all
dependencies are managed by Azure
Data Factory
• Upstream slices that are Ready stay
available
• Downstream slices that are dependent
stay Pending
• Enable diagnostics to produce logs,
disabled by default
• Add Alerts for Failed or Successful Runs to
receive email notification
Demonstration
General information
Pricing – Low frequency ( <= 1 / day )
USAGE PRICE
Cloud First 5 activities/month Free
6 – 100 activities/month $0.60 per activity
>100 activities/month $0.48 per activity
On-Premises First 5 activities/month Free
6 – 100 activities/month $1.50 per activity
>100 activities/month $1.20 per activity
* Pricing in USD correct at 4 December 2015
Pricing – High frequency ( > 1 / day )
USAGE PRICE
Cloud <= 100 activities/month $0.80 per activity
>100 activities/month $0.64 per activity
On-Premises <= 100 activities/month $2.50 per activity
>100 activities/month $2.00 per activity
* Pricing in USD correct at 4 December 2015
Pricing – Data movement
Cloud $0.25 per hour
On-Premises $0.10 per hour
Pricing – Inactive pipeline
$0.80/month
* Pricing in USD correct at 4 December 2015
Summary
• Use Azure Data Factory if:
• Dealing with Big Data
• Source or destination is in the cloud
• Cut down environment cost
• Cut down administration cost
• Azure is on one side of the movement / transformation
• Consider hybrid scenarios with other data management tools, for example
SQL Server Integration Services
More Information
• Documentation portal
• https://azure.microsoft.com/en-us/documentation/services/data-
factory/
• Learning map
• https://azure.microsoft.com/en-us/documentation/articles/data-
factory-learning-map/
• Samples on github
• https://github.com/Azure/Azure-DataFactory
Thank you!

More Related Content

What's hot

What's hot (20)

Intro to Azure Data Factory v1
Intro to Azure Data Factory v1Intro to Azure Data Factory v1
Intro to Azure Data Factory v1
 
Analyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeAnalyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data Lake
 
Azure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsAzure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake Analytics
 
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
 
J1 T1 4 - Azure Data Factory vs SSIS - Regis Baccaro
J1 T1 4 - Azure Data Factory vs SSIS - Regis BaccaroJ1 T1 4 - Azure Data Factory vs SSIS - Regis Baccaro
J1 T1 4 - Azure Data Factory vs SSIS - Regis Baccaro
 
Azure data factory
Azure data factoryAzure data factory
Azure data factory
 
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...
 
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data Lake
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data Week
 
Microsoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the CloudMicrosoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the Cloud
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data Lake
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Pipelines and Packages: Introduction to Azure Data Factory (Techorama NL 2019)
Pipelines and Packages: Introduction to Azure Data Factory (Techorama NL 2019)Pipelines and Packages: Introduction to Azure Data Factory (Techorama NL 2019)
Pipelines and Packages: Introduction to Azure Data Factory (Techorama NL 2019)
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
Integration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data LakeIntegration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data Lake
 

Viewers also liked

Sistemas de seguridad deportiva(xcupware)
Sistemas de seguridad deportiva(xcupware)Sistemas de seguridad deportiva(xcupware)
Sistemas de seguridad deportiva(xcupware)
Dr. Manuel Concepción
 
El gran-grimorio-papa-honorio
El gran-grimorio-papa-honorioEl gran-grimorio-papa-honorio
El gran-grimorio-papa-honorio
anibal8500l
 
Worldstar Group Credentials
Worldstar Group CredentialsWorldstar Group Credentials
Worldstar Group Credentials
laurentjacquot
 
53328654 airtel-pro
53328654 airtel-pro53328654 airtel-pro
53328654 airtel-pro
Soumya Sahoo
 

Viewers also liked (20)

Azure data factory
Azure data factoryAzure data factory
Azure data factory
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Introduction to Azure Data Factory
Introduction to Azure Data FactoryIntroduction to Azure Data Factory
Introduction to Azure Data Factory
 
Panografias
PanografiasPanografias
Panografias
 
CURRICULUM WCF
CURRICULUM WCFCURRICULUM WCF
CURRICULUM WCF
 
Charlotte Mason in a Nutshell
Charlotte Mason in a NutshellCharlotte Mason in a Nutshell
Charlotte Mason in a Nutshell
 
Ucrete - El piso más resistente
Ucrete - El piso más resistenteUcrete - El piso más resistente
Ucrete - El piso más resistente
 
Multimedia guide on iOS and Android with Content Management Tool for own guid...
Multimedia guide on iOS and Android with Content Management Tool for own guid...Multimedia guide on iOS and Android with Content Management Tool for own guid...
Multimedia guide on iOS and Android with Content Management Tool for own guid...
 
Universidad de Alcalá (UAH). Presentación en español. 2012/2013
Universidad de Alcalá (UAH). Presentación en español. 2012/2013Universidad de Alcalá (UAH). Presentación en español. 2012/2013
Universidad de Alcalá (UAH). Presentación en español. 2012/2013
 
A guide to the CAO system 2015
A guide to the CAO system 2015A guide to the CAO system 2015
A guide to the CAO system 2015
 
Sistemas de seguridad deportiva(xcupware)
Sistemas de seguridad deportiva(xcupware)Sistemas de seguridad deportiva(xcupware)
Sistemas de seguridad deportiva(xcupware)
 
BasesdeDatosTdeA
BasesdeDatosTdeABasesdeDatosTdeA
BasesdeDatosTdeA
 
[GAB2016] Workshop - Industrialisez vos expérimentations Azure Machine Learni...
[GAB2016] Workshop - Industrialisez vos expérimentations Azure Machine Learni...[GAB2016] Workshop - Industrialisez vos expérimentations Azure Machine Learni...
[GAB2016] Workshop - Industrialisez vos expérimentations Azure Machine Learni...
 
Uruguay Educa y Programa RUMBO: experiencias elearning exitosas en ANEP
Uruguay Educa y Programa RUMBO: experiencias elearning exitosas en ANEPUruguay Educa y Programa RUMBO: experiencias elearning exitosas en ANEP
Uruguay Educa y Programa RUMBO: experiencias elearning exitosas en ANEP
 
Business process modeling and analysis for data warehouse design
Business process modeling and analysis for data warehouse designBusiness process modeling and analysis for data warehouse design
Business process modeling and analysis for data warehouse design
 
T8 constel·lacions
T8 constel·lacionsT8 constel·lacions
T8 constel·lacions
 
Hemerotecas digitales. Lola Rodríguez Fuentes
Hemerotecas digitales. Lola Rodríguez FuentesHemerotecas digitales. Lola Rodríguez Fuentes
Hemerotecas digitales. Lola Rodríguez Fuentes
 
El gran-grimorio-papa-honorio
El gran-grimorio-papa-honorioEl gran-grimorio-papa-honorio
El gran-grimorio-papa-honorio
 
Worldstar Group Credentials
Worldstar Group CredentialsWorldstar Group Credentials
Worldstar Group Credentials
 
53328654 airtel-pro
53328654 airtel-pro53328654 airtel-pro
53328654 airtel-pro
 

Similar to A lap around Azure Data Factory

Tokyo azure meetup #2 big data made easy
Tokyo azure meetup #2   big data made easyTokyo azure meetup #2   big data made easy
Tokyo azure meetup #2 big data made easy
Tokyo Azure Meetup
 
Geek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsGeek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure Environments
IDERA Software
 

Similar to A lap around Azure Data Factory (20)

CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
 
Accelerating Business Intelligence Solutions with Microsoft Azure pass
Accelerating Business Intelligence Solutions with Microsoft Azure   passAccelerating Business Intelligence Solutions with Microsoft Azure   pass
Accelerating Business Intelligence Solutions with Microsoft Azure pass
 
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Modern ETL: Azure Data Factory, Data Lake, and SQL DatabaseModern ETL: Azure Data Factory, Data Lake, and SQL Database
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
 
Talavant Data Lake Analytics
Talavant Data Lake Analytics Talavant Data Lake Analytics
Talavant Data Lake Analytics
 
Azure PaaS (WebApp & SQL Database) workshop solution
Azure PaaS (WebApp & SQL Database) workshop solutionAzure PaaS (WebApp & SQL Database) workshop solution
Azure PaaS (WebApp & SQL Database) workshop solution
 
CRM UG Belux March 2017 - Power BI and Dynamics 365
CRM UG Belux March 2017 - Power BI and Dynamics 365CRM UG Belux March 2017 - Power BI and Dynamics 365
CRM UG Belux March 2017 - Power BI and Dynamics 365
 
Tokyo azure meetup #2 big data made easy
Tokyo azure meetup #2   big data made easyTokyo azure meetup #2   big data made easy
Tokyo azure meetup #2 big data made easy
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
Geek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsGeek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure Environments
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
 
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data LakeNDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
 
Azure synapse by usama whaba khan
Azure synapse by usama whaba khanAzure synapse by usama whaba khan
Azure synapse by usama whaba khan
 
Introducing Azure Databases
Introducing Azure DatabasesIntroducing Azure Databases
Introducing Azure Databases
 
7 steps to simplifying your AI workflows
7 steps to simplifying your AI workflows7 steps to simplifying your AI workflows
7 steps to simplifying your AI workflows
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
Azure full
Azure fullAzure full
Azure full
 
Deep thoughts from the real world of azure
Deep thoughts from the real world of azureDeep thoughts from the real world of azure
Deep thoughts from the real world of azure
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
 

More from BizTalk360

More from BizTalk360 (20)

Optimise Business Activity Tracking – Insights from Smurfit Kappa
Optimise Business Activity Tracking – Insights from Smurfit KappaOptimise Business Activity Tracking – Insights from Smurfit Kappa
Optimise Business Activity Tracking – Insights from Smurfit Kappa
 
Optimise Business Activity Tracking – Insights from Smurfit Kappa
Optimise Business Activity Tracking – Insights from Smurfit KappaOptimise Business Activity Tracking – Insights from Smurfit Kappa
Optimise Business Activity Tracking – Insights from Smurfit Kappa
 
What's inside "migrating to biz talk server 2020" Book (BizTalk360 Webinar)
What's inside "migrating to biz talk server 2020" Book (BizTalk360 Webinar)What's inside "migrating to biz talk server 2020" Book (BizTalk360 Webinar)
What's inside "migrating to biz talk server 2020" Book (BizTalk360 Webinar)
 
Integration Monday - Logic Apps: Development Experiences
Integration Monday - Logic Apps: Development ExperiencesIntegration Monday - Logic Apps: Development Experiences
Integration Monday - Logic Apps: Development Experiences
 
Integration Monday - BizTalk Migrator Deep Dive
Integration Monday - BizTalk Migrator Deep DiveIntegration Monday - BizTalk Migrator Deep Dive
Integration Monday - BizTalk Migrator Deep Dive
 
Testing for Logic App Solutions | Integration Monday
Testing for Logic App Solutions | Integration MondayTesting for Logic App Solutions | Integration Monday
Testing for Logic App Solutions | Integration Monday
 
No-Slides
No-SlidesNo-Slides
No-Slides
 
System Integration using Reactive Programming | Integration Monday
System Integration using Reactive Programming | Integration MondaySystem Integration using Reactive Programming | Integration Monday
System Integration using Reactive Programming | Integration Monday
 
Building workflow solution with Microsoft Azure and Cloud | Integration Monday
Building workflow solution with Microsoft Azure and Cloud | Integration MondayBuilding workflow solution with Microsoft Azure and Cloud | Integration Monday
Building workflow solution with Microsoft Azure and Cloud | Integration Monday
 
Serverless Minimalism: How to architect your apps to save 98% on your Azure b...
Serverless Minimalism: How to architect your apps to save 98% on your Azure b...Serverless Minimalism: How to architect your apps to save 98% on your Azure b...
Serverless Minimalism: How to architect your apps to save 98% on your Azure b...
 
Migrating BizTalk Solutions to Azure: Mapping Messages | Integration Monday
Migrating BizTalk Solutions to Azure: Mapping Messages | Integration MondayMigrating BizTalk Solutions to Azure: Mapping Messages | Integration Monday
Migrating BizTalk Solutions to Azure: Mapping Messages | Integration Monday
 
Integration-Monday-Infrastructure-As-Code-With-Terraform
Integration-Monday-Infrastructure-As-Code-With-TerraformIntegration-Monday-Infrastructure-As-Code-With-Terraform
Integration-Monday-Infrastructure-As-Code-With-Terraform
 
Integration-Monday-Stateful-Programming-Models-Serverless-Functions
Integration-Monday-Stateful-Programming-Models-Serverless-FunctionsIntegration-Monday-Stateful-Programming-Models-Serverless-Functions
Integration-Monday-Stateful-Programming-Models-Serverless-Functions
 
Integration-Monday-Serverless-Slackbots-with-Azure-Durable-Functions
Integration-Monday-Serverless-Slackbots-with-Azure-Durable-FunctionsIntegration-Monday-Serverless-Slackbots-with-Azure-Durable-Functions
Integration-Monday-Serverless-Slackbots-with-Azure-Durable-Functions
 
Integration-Monday-Building-Stateful-Workloads-Kubernetes
Integration-Monday-Building-Stateful-Workloads-KubernetesIntegration-Monday-Building-Stateful-Workloads-Kubernetes
Integration-Monday-Building-Stateful-Workloads-Kubernetes
 
Integration-Monday-Logic-Apps-Tips-Tricks
Integration-Monday-Logic-Apps-Tips-TricksIntegration-Monday-Logic-Apps-Tips-Tricks
Integration-Monday-Logic-Apps-Tips-Tricks
 
Integration-Monday-Terraform-Serverless
Integration-Monday-Terraform-ServerlessIntegration-Monday-Terraform-Serverless
Integration-Monday-Terraform-Serverless
 
Integration-Monday-Microsoft-Power-Platform
Integration-Monday-Microsoft-Power-PlatformIntegration-Monday-Microsoft-Power-Platform
Integration-Monday-Microsoft-Power-Platform
 
One name unify them all
One name unify them allOne name unify them all
One name unify them all
 
Securely Publishing Azure Services
Securely Publishing Azure ServicesSecurely Publishing Azure Services
Securely Publishing Azure Services
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

A lap around Azure Data Factory

  • 1. Sponsored & Brought to you by A Lap around Azure Data Factory Martin Abbott http://www.twitter.com/martinabbott https://au.linkedin.com/in/mjabbott
  • 2. A Lap around Azure Data Factory Martin Abbott @martinabbott
  • 3. About me 10+ years experience Integration, messaging and cloud person Organiser of Perth Microsoft Cloud User Group Member of GlobalAzure Bootcamp admin team BizTalk developer and architect Identity management maven IoT enthusiast Soon to be Australian Citizen
  • 5. Overview of an Azure Data Factory
  • 6. Overview of an Azure Data Factory • Cloud based data integration • Orchestration and transformation • Automation • Large volumes of data • Part of Cortana Analytics Suite Information Management • Fully managed service, scalable, reliable
  • 7. Anatomy of an Azure Data Factory An Azure Data Factory is made up of:
  • 8. Linked services • Represents either • a data store • File system • On-premises SQL Server • Azure storage • Azure DocumentDB • Azure Data Lake Store • etc. • a compute resource • HDInsight (own or on demand) • Azure Machine Learning Endpoint • Azure Batch • Azure SQL Database • Azure Data Lake Analytics
  • 9. Data sets • Named references to data • Used for both input and output • Identifies structure • Files, tables, folders, documents • Internal or external • Use SliceStart and SliceEnd system variables to create distinct slices on output data sets, e.g., unique folder based on date
  • 10. Activities • Define actions to perform on data • Zero or more input data sets • One or more output data sets • Unit of orchestration of a pipeline • Activities for • data movement • data transformation • data analysis • Use WindowStart and WindowEnd system variables to select relevant data using a tumbling window
  • 11. Pipelines • Logical grouping of activities • Provides a unit of work that performs a task • Can set active period to run in the past to back fill data slices • Back filling can be performed in parallel
  • 12. Scheduling • Data sets have an availability "availability": { "frequency": "Hour", "interval": 1 } • Activities have a schedule (tumbling window) "scheduler": { "frequency": "Hour", "interval": 1 } • Pipelines have an active period "start": "2015-01-01T08:00:00Z" "end": "2015-01-01T11:00:00Z“ OR “end” = “start” + 48 hours if not specified OR “end”: “9999-09-09” for indefinite
  • 13. Data Lineage / Dependencies • How does Azure Data Factory know how to link Pipelines? • Uses Input and Output data sets • On the Diagram in portal, can toggle data lineage on and off • external required (and externalData policy optional) for data sets created outside Azure Data Factory • How does Azure Data Factory know how to link data sets that have different schedules? • Uses startTime, endTime and dependency model
  • 14. Functions • Rich set of functions to • Specify data selection queries • Specify input data set dependencies • [startTime, endTime] – data set slice • [f(startTime, endTime), g(startTime, endTime)] – dependency period • Use system variables as parameters • Functions for text formatting and date/time selection • Text.Format('{0:yyyy}',WindowStart) • Date.AddDays(SliceStart, -7 - Date.DayOfWeek(SliceStart))
  • 16. Data movement SOURCE SINK Azure Blob Azure Blob Azure Table Azure Table Azure SQL Database Azure SQL Database Azure SQL Data Warehouse Azure SQL Data Warehouse Azure DocumentDB Azure DocumentDB Azure Data Lake Store Azure Data Lake Store SQL Server on-premises / Azure IaaS SQL Server on-premises / Azure IaaS File System on-premises / Azure IaaS File System on-premises / Azure IaaS Oracle Database on-premises / Azure IaaS MySQL Database on-premises / Azure IaaS DB2 Database on-premises / Azure IaaS Teradata Database on-premises / Azure IaaS Sybase Database on-premises / Azure IaaS PostgreSQL Database on-premises / Azure IaaS
  • 17. Data movement • Uses the Copy activity and Data Movement Service or Data Management Gateway (for on-premises or Azure IaaS) • Globally available service for data movement (except Australia) • executes at sink location, unless source is on-premises (or IaaS) then uses Data Management Gateway • Exactly one input and exactly one output • Support for securely moving between on-premises and the cloud • Automatic type conversions from source to sink data types • File based copy supports binary, text and Avro formats, and allows for conversion between formats • Data Management Gateway supports multiple data sources but only a single Azure Data Factory Source Data Movement Service WAN Serialisation- Deserialisation Compression Column Mapping … WAN Sink Source Data Management Gateway LAN/ WAN Serialisation- Deserialisation Compression Column Mapping … SinkLAN/ WAN
  • 18. Data analysis and transformation
  • 19. Data analysis and transformation TRANSFORMATION ACTIVITY COMPUTE ENVIRONMENT Hive HDInsight [Hadoop] Pig HDInsight [Hadoop] MapReduce HDInsight [Hadoop] Hadoop Streaming HDInsight [Hadoop] Machine Learning activities: Batch Execution and Update Resource Azure VM Stored Procedure Azure SQL Database Data Lake Analytics U-SQL Azure Data Lake Analytics DotNet HDInsight [Hadoop] or Azure Batch
  • 20. Data analysis and transformation • Two types of compute environment • On-demand: Data Factory fully manages environment, currently HDInsight only • Set timeToLive to set allowed idle time once job finishes • Set osType for windows or linux • Set clusterSize to determine number of nodes • Provisioning an HDInsight cluster on-demand can take some time • Bring your own: Register own computing environment for use as a linked service • HDInsight Linked Service • clusterUri, username, password and location • Azure Batch Linked Service • accountName, accessKey and poolName • Machine Learning Linked Service • mlEndpoint and apiKey • Data Lake Analytics Linked Service • accountName, dataLakeAnalyticsUri and authorization • Azure SQL Database Linked Service • connectionString
  • 22. Development • JSON for all artefacts • Ease of management by source control • Can be developed using: • Data Factory Editor • In Azure Portal • Create and deploy artefacts • PowerShell • Cmdlets for each main function in PS ARM • Visual Studio • Azure Data Factory Templates • .NET SDK
  • 23. Visual Studio • Rich set of templates including • Sample applications • Data analysis and transformation using Hive and Pig • Data movement between typical environments • Can include sample data • Can create Azure Data Factory, storage and compute resources • Can Publish to Azure Data Factory • No toolbox, mostly hand crafting JSON
  • 24. Tips and Tricks with Visual Studio Templates • Something usually fails • Issues with sample data • Run once to create Data Factory and storage accounts • Usually first run will also create a folder containing Sample Data but NO JSON artifacts • May need to manually edit PowerShell scripts or perform manual upload • Once corrected, deselect Sample Data and run again creating new solution • Ensure Publish to Data Factory is deselected and JSON artifacts are created • Issues with Data Factory deployment • Go to portal and check what failed • May need to manually create item but deleting published item and recreating with JSON from project • When deploying, may need to unselect item that is failing • You cannot delete from the project • Need to Exclude From Project • Once excluded can delete from disk
  • 25. Deployment • Add Config files to your Visual Studio project • Deployment files contain, for instance, connection strings to resources that are replaced at Publish time • Add deployment files for each environment you are deploying to, e.g., Dev, UAT, Prod • When publishing to Azure Data Factory choose appropriate Config file to ensure correct settings are applied • Publish only artefacts required
  • 27. Monitoring • Data slices may fail • Drill in to errors, diagnose, fix and rerun • Failed data slices can be rerun and all dependencies are managed by Azure Data Factory • Upstream slices that are Ready stay available • Downstream slices that are dependent stay Pending • Enable diagnostics to produce logs, disabled by default • Add Alerts for Failed or Successful Runs to receive email notification
  • 30. Pricing – Low frequency ( <= 1 / day ) USAGE PRICE Cloud First 5 activities/month Free 6 – 100 activities/month $0.60 per activity >100 activities/month $0.48 per activity On-Premises First 5 activities/month Free 6 – 100 activities/month $1.50 per activity >100 activities/month $1.20 per activity * Pricing in USD correct at 4 December 2015
  • 31. Pricing – High frequency ( > 1 / day ) USAGE PRICE Cloud <= 100 activities/month $0.80 per activity >100 activities/month $0.64 per activity On-Premises <= 100 activities/month $2.50 per activity >100 activities/month $2.00 per activity * Pricing in USD correct at 4 December 2015
  • 32. Pricing – Data movement Cloud $0.25 per hour On-Premises $0.10 per hour Pricing – Inactive pipeline $0.80/month * Pricing in USD correct at 4 December 2015
  • 33. Summary • Use Azure Data Factory if: • Dealing with Big Data • Source or destination is in the cloud • Cut down environment cost • Cut down administration cost • Azure is on one side of the movement / transformation • Consider hybrid scenarios with other data management tools, for example SQL Server Integration Services
  • 34. More Information • Documentation portal • https://azure.microsoft.com/en-us/documentation/services/data- factory/ • Learning map • https://azure.microsoft.com/en-us/documentation/articles/data- factory-learning-map/ • Samples on github • https://github.com/Azure/Azure-DataFactory