SlideShare a Scribd company logo
1 of 45
SSIS & HDInsight
Tillmann Eitelberg
Oliver Engels
Who we are…
Tillmann Eitelberg

Oliver Engels

• CTO of oh22information services GmbH

• CEO of oh22data AG

• PASS Regional Mentor Germany

• PASS Regional Mentor Germany

• Vice-president PASS Germany

• President PASS Germany

• Chapter Leader CologneBonn, Germany

• Chapter Leader Frankfurt, Germany

• Microsoft MVP

• Microsoft MVP
• Microsoft vTSP
Agenda
•
•
•
•
•
•
•
•
•

Traditional ETL Process
Challenges of Big Data and unstructured data
Useful Apache Hadoop Components for ETL
Some statements to be clarified...
Using Apache Hadoop within the ETL process
SSIS – not just an simple ETL Tool
Tools to work with HDInsight
Get started using Windows Azure HDInsight
Use SQL Server Integration Services to …
Traditional ETL Process
• Extract data from different sources

• different source systems
• different data organization and/or format
• (non-)relational databases, flat files

• Transforms it to fit operational needs
•
•
•
•
•

Translating coded values
Encoding free-form values
Deriving a new calculated value
Aggregation
data profiling, data quality

• Loads it into the end target

• database, data mart, data warehouse
Traditional ETL Process

OLAP Analysis

CRM
Load
Extract

Transform

Load

ERP
Load

Data
Warehouse
Data Mining

Web Site Traffic

Reporting
Traditional ETL Process

OLAP Analysis

CRM

ERP

E T

L
L
L

DBMS

E T

L
L
L

Data
Warehouse

E T

L
L
L
Data Mining

Web Site Traffic

Staging
Area

Data Marts

Reporting
Traditional ETL Process (Microsoft Glasses)
• Control Flow
• implement repeating workflows
• Connecting containers and tasks into an
ordered control flow by using precedence
constraints
• controlling external processes
• load meta objects and data container
• prepare data files
Traditional ETL Process (Microsoft Glasses)
• Data Flow
• Adding one or more sources to extract data
from files and databases
• Adding the transformations that meet the
business requirements
• Adding one or more destinations to load data
into data stores such as files and databases
• Configuring error outputs on components to
handle problems
Microsoft Big Data Solution
Challenges of Big Data
• large amounts of data from multiple sources
• the volume of this amount of data goes into the terabytes,
petabytes and exabytes
• Classic relational database systems as well as statistical and
visualization programs are often not able to handle such large
amounts of data
• according to calculations from the year 2011, the global volume
of data doubles every 2 years
Challenges of unstructured data
• does not have a pre-defined data model or is not organized in a predefined manner
• typically text-heavy, but may contain data such as dates, numbers,
and facts as well
• structure, while not formally defined, can still be implied
• aggregates can not be accessed with computer programs through a
single interface
• Emails, audio - and video files without tags, also contributions in
different media such as online forums or on social-media platforms
Objectives of Big data
Objectives of Big data

Real time tweets visualized on a map
HDInsight/Hadoop Eco-System
Red
Blue
Purple

= Core Hadoop
= Data processing
= Microsoft integration
points and value adds
Orange = Data Movement
Green = Packages
Useful Apache Hadoop Components (for ETL)
Apache Flume

Apache Sqoop

• Stream data from multiple sources into
Hadoop for analysis

• Allows data imports from external datastores
and enterprise data warehouses into Hadoop

• a large scale log aggregation framework

• Parallelizes data transfer for fast performance
and optimal system utilization

• Collect high-volume Web logs in real time
• Insulate themselves from transient spikes when
the rate of incoming data exceeds the rate at
which data can be written to the destination
• Guarantee data delivery
• Scale horizontally to handle additional data
volume

• Copies data quickly from external systems to
Hadoop
• Makes data analysis more efficient
• Mitigates excessive loads to external systems
Useful Apache Hadoop Components (for ETL)
Apache Hive

Apache Pig

• data warehouse infrastructure built on top of
Hadoop

• Platform for cerating MapReduce programs

• supports analysis of large datasets stored in
Hadoop's HDFS
• SQL-like language called HiveQL

• Internally a compiler translates HiveQL
statements into a directed acyclic graph of
MapReduce jobs

• Language is called Pig Latin
• abstracts Java MapReduce Job to something
similar to SQL
• Can use User Defined Functions written in Java,
Python, JavaScript, Ruby or Groovy
• Pig uses ETL
Useful Apache Hadoop Components (for ETL)
ODBC/JDBC Connectors

Apache Storm

• Microsoft® Hive ODBC Driver

• distributed real-time computation system for
processing fast, large streams of data
• processing one million 100 byte messages per
second per node
• Scalable with parallel calculations that run
across a cluster of machines
• Fault-tolerant – when workers die, Storm will
automatically restart them. If a node dies, the
worker will be restarted on another node
• Storm guarantees that each unit of data (tuple)
will be processed at least once or exactly once.

http://www.microsoft.com/en-us/download/details.aspx?id=40886

• Original: Apache Hive ODBC Driver provided
by Simba
• transforms an application’s SQL query into the
equivalent form in HiveQL
• Supports all major on-premise and cloud
Hadoop / Hive distributions
• Supports data types: TinyInt, SmallInt, Int,
BigInt, Float, Double, Boolean, String, Decimal
and TimeStamp
Some statements to be clarified...
• Hadoop will steal work from ETL solutions
• ETL is running faster on Hadoop
• Hadoop is not a data integration tool
• Hadoop is a batch processing system and Hadoop jobs tend to
have high latency
• Data integration solutions do not run natively in Hadoop
• Elephants do not live isolated
• Hadoop is not a solution for data quality (and other specialized
Transformations)
Using Apache Hadoop within the ETL process

OLAP Analysis
CRM

ERP

E T

DBMS

L
L
L

E T

L
L
L

Data
Warehouse

E T

L
L
L
Data Mining

Web Site Traffic

Staging
Area

Social
Media

Sensor
Logs

Sqoop
Flume
Storm

Hive
Pig

Data Marts

ODBC
JDBC
Sqoop

Reporting

Data Science
SSIS – not just a simple ETL Tool
Use SQL Server Integration Services to…
•
•
•
•
•
•

build complex workflows
manage Windows Azure and HDInsight clusters
load data into HDInsight/HDFS
control jobs on HDInsight
get data from Hive, Pig, …
combine Hadoop with „traditional“ ETL
Tools to work with HDInsight
• SSIS Tasks for HDInsight
http://www.youtube.com/watch?v=2Aj9_w3y9Xo&feature=player_embedded
&list=PLoGAcXKPcRvbTr23ujEN953pLP_nDyZJC#t=2184

• Announced at PASS Summit 2013

• Experimental Release on Codeplex
• No timeline yet
Tools to work with HDInsight
Tools to work with HDInsight
Tools to work with HDInsight
• Azure Storage Explorer

http://azurestorageexplorer.codeplex.com/

• CloudBerry Explorer for Azure Cloud Storage
http://www.cloudberrylab.com/free-microsoft-azure-explorer.aspx
• Cerebrata Azure Management Studio
http://www.cerebrata.com/

• Red Gate HDFS Explorer (beta)
http://bigdata.red-gate.com/
Tools to work with HDInsight
• Microsoft .NET SDK For Hadoop
(nuget Packages)
• Windows Azure HDInsight

Provides a .NET API for cluster management and job submission on Windows Azure HDInsight service.

• Microsoft .NET Map Reduce API For Hadoop

Provides a .NET API for the Map/Reduce functionality of Hadoop Streaming.

• Microsoft .NET API For Hadoop WebClient
Provides a .NET API for WebClient

• Microsoft .NET API for Hadoop

Provides a .NET API for working with Hadoop clusters over HTTP
Tools to work with HDInsight
• some API requries .NET 4.5
• By default SSIS 2012 uses
.NET 4.0
• Use SSDT 2012 BI Edition (or
higher) to work with .NET 4.5
in scripting tasks and
components
Tools to work with HDInsight
• NUGet Package Manager is not fully compatible with SQL Server
Integration Services Script Task
• nuget packages (assemblies) must be installed in the global
assembly cache

gacinstall –I <assembly.dll>
• nuget packages/assemblies must be installed on all servers that
are running the packages.
• all assemblies need a strong name
Tools to work with HDInsight
• Adding a Strong Name to an existing Assembly
sn -k keyPair.snk
ildasm AssemblyName.dll /out:AssemblyName.il
ilasm AssemblyName.il /dll /key= keyPair.snk
Get started using Windows Azure HDInsight
• Create a Storage Account

• Define Name/URL of the storage account
• Define location/affinity group, best setting currently „North Europe“
• Set replication, to avoid costs use „Locally Redundant“

• Create a container in the newly created storage account
• Manage Access Keys

• Get Storage Account Name
• Get Primary Access Key
Get started using Windows Azure HDInsight
• Create a Certificate
makecert -sky exchange -r -n "CN=SQLKonferenz"
-pe -a sha1 -len 2048 -ss My
"SQLKonferenz.cer“

• Upload Certificate to Windows Azure
• Get ScubscriptionId
• Get Thumbprint
Get started using Windows Azure HDInsight
Demo
Get started using Windows
Azure HDInsight
Manage Your HDInsight Cluster
• Create a container in your Windows Azure Storage account
• Create HDInsight Cluster
• Storage Container
• Authentication (Username/Password)
• Cluster Size

• Delete HDInsight Cluster
• (Delete corresponding container)
Manage Your HDInsight Cluster
// Get the certificate object from certificate store using thumbprint
var store = new X509Store();

store.Open(OpenFlags.ReadOnly);
var cert = store.Certificates.Cast<X509Certificate2>().First(
item => item.Thumbprint == thumbprint
);

// Create HDInsightClient object using factory method
var creds = new HDInsightCertificateCredential(
new Guid(subscriptionId), cert
);

var client = HDInsightClient.Connect(creds);
Demo
Upload data to HDInsight
var storageCredentials = new

StorageCredentials(

defaultStorageAccountName,

defaultStorageAccountKey
);
var storageAccount = new CloudStorageAccount(storageCredentials, true);
var cloudBlobClient = storageAccount.CreateCloudBlobClient();
var cloudBlobContainer = cloudBlobClient.GetContainerReference(defaultStorageCont);

var blockBlob = cloudBlobContainer.GetBlockBlobReference(
@"example/data/gutenberg/"
);
using (var fileStream = System.IO.File.OpenRead(filename))
{

blockBlob.UploadFromStream(fileStream);
}
Upload data to HDInsight
• ~ 300 MB ca. 45 Sec.
• from an Azure VM in the same region
Run a MapReduce Program
// Create Job Submission Client object
var creds = new JobSubmissionCertificateCredential(
new Guid(subscriptionId),
cert,
clusterName);
var jobClient = JobSubmissionClientFactory.Connect(creds);

// Create job object that captures details of the job
var mrJobDefinition = new MapReduceJobCreateParameters()
{

JarFile = "wasb:///example/jars/hadoop-examples.jar",
ClassName = "wordcount"
};
mrJobDefinition.Arguments.Add("wasb:///example/data/gutenberg/davinci.txt");
mrJobDefinition.Arguments.Add("wasb:///example/data/WordCountOutput");

// Submit job to the cluster
var jobResults = jobClient.CreateMapReduceJob(mrJobDefinition);
Demo
Run a Hive Query
• Hive Query via .NET Hadoop SDK
• Download result from Hive query
• Load result from Hive query direct in the data flow
• Microsoft® Hive ODBC Driver

http://www.microsoft.com/en-us/download/confirmation.aspx?id=40886
(available

for x86 and x64)
Demo
Complete HDInsight Package
Vielen Dank!
Tillmann Eitelberg
t.eitelberg@oh22.net
Oliver Engels
o.engels@oh22.net
SQL Server Konferenz 2014 - SSIS & HDInsight

More Related Content

What's hot

Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Hortonworks
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on DockerDataWorks Summit
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streamsJoey Echeverria
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureDataWorks Summit
 
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...Alex Gorbachev
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
Enabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopEnabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopDataWorks Summit
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
Combine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNCombine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNHortonworks
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Hortonworks
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingPeter Haase
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Global Business Events
 

What's hot (20)

Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Enabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopEnabling real interactive BI on Hadoop
Enabling real interactive BI on Hadoop
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Combine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNCombine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARN
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 

Similar to SQL Server Konferenz 2014 - SSIS & HDInsight

Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Alex Gorbachev
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the OrganizationSeeling Cheung
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.OW2
 
New big data architecture in hadoop.pptx
New big data architecture in hadoop.pptxNew big data architecture in hadoop.pptx
New big data architecture in hadoop.pptxVanshGupta597842
 
Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3xKinAnx
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Andrew Brust
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoopch adnan
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutionssolarisyougood
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 

Similar to SQL Server Konferenz 2014 - SSIS & HDInsight (20)

Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
 
New big data architecture in hadoop.pptx
New big data architecture in hadoop.pptxNew big data architecture in hadoop.pptx
New big data architecture in hadoop.pptx
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3
 
Resume_VipinKP
Resume_VipinKPResume_VipinKP
Resume_VipinKP
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 
Talend for big_data_intorduction
Talend for big_data_intorductionTalend for big_data_intorduction
Talend for big_data_intorduction
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 

More from Tillmann Eitelberg

Data lake analytics for the admin
Data lake analytics for the adminData lake analytics for the admin
Data lake analytics for the adminTillmann Eitelberg
 
Embrace and extend first-class activity and 3rd party ecosystem for ssis in adf
Embrace and extend first-class activity and 3rd party ecosystem for ssis in adfEmbrace and extend first-class activity and 3rd party ecosystem for ssis in adf
Embrace and extend first-class activity and 3rd party ecosystem for ssis in adfTillmann Eitelberg
 
Webanalytics with Microsoft BI
Webanalytics with Microsoft BIWebanalytics with Microsoft BI
Webanalytics with Microsoft BITillmann Eitelberg
 
Power BI - The self service BI Lifecycle in the cloud
Power BI - The self service BI Lifecycle in the cloudPower BI - The self service BI Lifecycle in the cloud
Power BI - The self service BI Lifecycle in the cloudTillmann Eitelberg
 
SQLSaturday #188 - Enterprise Information Management
SQLSaturday #188  - Enterprise Information ManagementSQLSaturday #188  - Enterprise Information Management
SQLSaturday #188 - Enterprise Information ManagementTillmann Eitelberg
 

More from Tillmann Eitelberg (8)

Data lake analytics for the admin
Data lake analytics for the adminData lake analytics for the admin
Data lake analytics for the admin
 
Embrace and extend first-class activity and 3rd party ecosystem for ssis in adf
Embrace and extend first-class activity and 3rd party ecosystem for ssis in adfEmbrace and extend first-class activity and 3rd party ecosystem for ssis in adf
Embrace and extend first-class activity and 3rd party ecosystem for ssis in adf
 
Industry 4.0 in a box
Industry 4.0 in a boxIndustry 4.0 in a box
Industry 4.0 in a box
 
Bioinformatics on Azure
Bioinformatics on AzureBioinformatics on Azure
Bioinformatics on Azure
 
Webanalytics with Microsoft BI
Webanalytics with Microsoft BIWebanalytics with Microsoft BI
Webanalytics with Microsoft BI
 
Power BI - The self service BI Lifecycle in the cloud
Power BI - The self service BI Lifecycle in the cloudPower BI - The self service BI Lifecycle in the cloud
Power BI - The self service BI Lifecycle in the cloud
 
Advanced DQS Integration
Advanced DQS IntegrationAdvanced DQS Integration
Advanced DQS Integration
 
SQLSaturday #188 - Enterprise Information Management
SQLSaturday #188  - Enterprise Information ManagementSQLSaturday #188  - Enterprise Information Management
SQLSaturday #188 - Enterprise Information Management
 

Recently uploaded

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Recently uploaded (20)

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

SQL Server Konferenz 2014 - SSIS & HDInsight

  • 1. SSIS & HDInsight Tillmann Eitelberg Oliver Engels
  • 2. Who we are… Tillmann Eitelberg Oliver Engels • CTO of oh22information services GmbH • CEO of oh22data AG • PASS Regional Mentor Germany • PASS Regional Mentor Germany • Vice-president PASS Germany • President PASS Germany • Chapter Leader CologneBonn, Germany • Chapter Leader Frankfurt, Germany • Microsoft MVP • Microsoft MVP • Microsoft vTSP
  • 3. Agenda • • • • • • • • • Traditional ETL Process Challenges of Big Data and unstructured data Useful Apache Hadoop Components for ETL Some statements to be clarified... Using Apache Hadoop within the ETL process SSIS – not just an simple ETL Tool Tools to work with HDInsight Get started using Windows Azure HDInsight Use SQL Server Integration Services to …
  • 4. Traditional ETL Process • Extract data from different sources • different source systems • different data organization and/or format • (non-)relational databases, flat files • Transforms it to fit operational needs • • • • • Translating coded values Encoding free-form values Deriving a new calculated value Aggregation data profiling, data quality • Loads it into the end target • database, data mart, data warehouse
  • 5. Traditional ETL Process OLAP Analysis CRM Load Extract Transform Load ERP Load Data Warehouse Data Mining Web Site Traffic Reporting
  • 6. Traditional ETL Process OLAP Analysis CRM ERP E T L L L DBMS E T L L L Data Warehouse E T L L L Data Mining Web Site Traffic Staging Area Data Marts Reporting
  • 7. Traditional ETL Process (Microsoft Glasses) • Control Flow • implement repeating workflows • Connecting containers and tasks into an ordered control flow by using precedence constraints • controlling external processes • load meta objects and data container • prepare data files
  • 8. Traditional ETL Process (Microsoft Glasses) • Data Flow • Adding one or more sources to extract data from files and databases • Adding the transformations that meet the business requirements • Adding one or more destinations to load data into data stores such as files and databases • Configuring error outputs on components to handle problems
  • 10. Challenges of Big Data • large amounts of data from multiple sources • the volume of this amount of data goes into the terabytes, petabytes and exabytes • Classic relational database systems as well as statistical and visualization programs are often not able to handle such large amounts of data • according to calculations from the year 2011, the global volume of data doubles every 2 years
  • 11. Challenges of unstructured data • does not have a pre-defined data model or is not organized in a predefined manner • typically text-heavy, but may contain data such as dates, numbers, and facts as well • structure, while not formally defined, can still be implied • aggregates can not be accessed with computer programs through a single interface • Emails, audio - and video files without tags, also contributions in different media such as online forums or on social-media platforms
  • 13. Objectives of Big data Real time tweets visualized on a map
  • 14. HDInsight/Hadoop Eco-System Red Blue Purple = Core Hadoop = Data processing = Microsoft integration points and value adds Orange = Data Movement Green = Packages
  • 15. Useful Apache Hadoop Components (for ETL) Apache Flume Apache Sqoop • Stream data from multiple sources into Hadoop for analysis • Allows data imports from external datastores and enterprise data warehouses into Hadoop • a large scale log aggregation framework • Parallelizes data transfer for fast performance and optimal system utilization • Collect high-volume Web logs in real time • Insulate themselves from transient spikes when the rate of incoming data exceeds the rate at which data can be written to the destination • Guarantee data delivery • Scale horizontally to handle additional data volume • Copies data quickly from external systems to Hadoop • Makes data analysis more efficient • Mitigates excessive loads to external systems
  • 16. Useful Apache Hadoop Components (for ETL) Apache Hive Apache Pig • data warehouse infrastructure built on top of Hadoop • Platform for cerating MapReduce programs • supports analysis of large datasets stored in Hadoop's HDFS • SQL-like language called HiveQL • Internally a compiler translates HiveQL statements into a directed acyclic graph of MapReduce jobs • Language is called Pig Latin • abstracts Java MapReduce Job to something similar to SQL • Can use User Defined Functions written in Java, Python, JavaScript, Ruby or Groovy • Pig uses ETL
  • 17. Useful Apache Hadoop Components (for ETL) ODBC/JDBC Connectors Apache Storm • Microsoft® Hive ODBC Driver • distributed real-time computation system for processing fast, large streams of data • processing one million 100 byte messages per second per node • Scalable with parallel calculations that run across a cluster of machines • Fault-tolerant – when workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node • Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. http://www.microsoft.com/en-us/download/details.aspx?id=40886 • Original: Apache Hive ODBC Driver provided by Simba • transforms an application’s SQL query into the equivalent form in HiveQL • Supports all major on-premise and cloud Hadoop / Hive distributions • Supports data types: TinyInt, SmallInt, Int, BigInt, Float, Double, Boolean, String, Decimal and TimeStamp
  • 18. Some statements to be clarified... • Hadoop will steal work from ETL solutions • ETL is running faster on Hadoop • Hadoop is not a data integration tool • Hadoop is a batch processing system and Hadoop jobs tend to have high latency • Data integration solutions do not run natively in Hadoop • Elephants do not live isolated • Hadoop is not a solution for data quality (and other specialized Transformations)
  • 19. Using Apache Hadoop within the ETL process OLAP Analysis CRM ERP E T DBMS L L L E T L L L Data Warehouse E T L L L Data Mining Web Site Traffic Staging Area Social Media Sensor Logs Sqoop Flume Storm Hive Pig Data Marts ODBC JDBC Sqoop Reporting Data Science
  • 20. SSIS – not just a simple ETL Tool
  • 21. Use SQL Server Integration Services to… • • • • • • build complex workflows manage Windows Azure and HDInsight clusters load data into HDInsight/HDFS control jobs on HDInsight get data from Hive, Pig, … combine Hadoop with „traditional“ ETL
  • 22. Tools to work with HDInsight • SSIS Tasks for HDInsight http://www.youtube.com/watch?v=2Aj9_w3y9Xo&feature=player_embedded &list=PLoGAcXKPcRvbTr23ujEN953pLP_nDyZJC#t=2184 • Announced at PASS Summit 2013 • Experimental Release on Codeplex • No timeline yet
  • 23. Tools to work with HDInsight
  • 24. Tools to work with HDInsight
  • 25. Tools to work with HDInsight • Azure Storage Explorer http://azurestorageexplorer.codeplex.com/ • CloudBerry Explorer for Azure Cloud Storage http://www.cloudberrylab.com/free-microsoft-azure-explorer.aspx • Cerebrata Azure Management Studio http://www.cerebrata.com/ • Red Gate HDFS Explorer (beta) http://bigdata.red-gate.com/
  • 26. Tools to work with HDInsight • Microsoft .NET SDK For Hadoop (nuget Packages) • Windows Azure HDInsight Provides a .NET API for cluster management and job submission on Windows Azure HDInsight service. • Microsoft .NET Map Reduce API For Hadoop Provides a .NET API for the Map/Reduce functionality of Hadoop Streaming. • Microsoft .NET API For Hadoop WebClient Provides a .NET API for WebClient • Microsoft .NET API for Hadoop Provides a .NET API for working with Hadoop clusters over HTTP
  • 27. Tools to work with HDInsight • some API requries .NET 4.5 • By default SSIS 2012 uses .NET 4.0 • Use SSDT 2012 BI Edition (or higher) to work with .NET 4.5 in scripting tasks and components
  • 28. Tools to work with HDInsight • NUGet Package Manager is not fully compatible with SQL Server Integration Services Script Task • nuget packages (assemblies) must be installed in the global assembly cache gacinstall –I <assembly.dll> • nuget packages/assemblies must be installed on all servers that are running the packages. • all assemblies need a strong name
  • 29. Tools to work with HDInsight • Adding a Strong Name to an existing Assembly sn -k keyPair.snk ildasm AssemblyName.dll /out:AssemblyName.il ilasm AssemblyName.il /dll /key= keyPair.snk
  • 30. Get started using Windows Azure HDInsight • Create a Storage Account • Define Name/URL of the storage account • Define location/affinity group, best setting currently „North Europe“ • Set replication, to avoid costs use „Locally Redundant“ • Create a container in the newly created storage account • Manage Access Keys • Get Storage Account Name • Get Primary Access Key
  • 31. Get started using Windows Azure HDInsight • Create a Certificate makecert -sky exchange -r -n "CN=SQLKonferenz" -pe -a sha1 -len 2048 -ss My "SQLKonferenz.cer“ • Upload Certificate to Windows Azure • Get ScubscriptionId • Get Thumbprint
  • 32. Get started using Windows Azure HDInsight
  • 33. Demo Get started using Windows Azure HDInsight
  • 34. Manage Your HDInsight Cluster • Create a container in your Windows Azure Storage account • Create HDInsight Cluster • Storage Container • Authentication (Username/Password) • Cluster Size • Delete HDInsight Cluster • (Delete corresponding container)
  • 35. Manage Your HDInsight Cluster // Get the certificate object from certificate store using thumbprint var store = new X509Store(); store.Open(OpenFlags.ReadOnly); var cert = store.Certificates.Cast<X509Certificate2>().First( item => item.Thumbprint == thumbprint ); // Create HDInsightClient object using factory method var creds = new HDInsightCertificateCredential( new Guid(subscriptionId), cert ); var client = HDInsightClient.Connect(creds);
  • 36. Demo
  • 37. Upload data to HDInsight var storageCredentials = new StorageCredentials( defaultStorageAccountName, defaultStorageAccountKey ); var storageAccount = new CloudStorageAccount(storageCredentials, true); var cloudBlobClient = storageAccount.CreateCloudBlobClient(); var cloudBlobContainer = cloudBlobClient.GetContainerReference(defaultStorageCont); var blockBlob = cloudBlobContainer.GetBlockBlobReference( @"example/data/gutenberg/" ); using (var fileStream = System.IO.File.OpenRead(filename)) { blockBlob.UploadFromStream(fileStream); }
  • 38. Upload data to HDInsight • ~ 300 MB ca. 45 Sec. • from an Azure VM in the same region
  • 39. Run a MapReduce Program // Create Job Submission Client object var creds = new JobSubmissionCertificateCredential( new Guid(subscriptionId), cert, clusterName); var jobClient = JobSubmissionClientFactory.Connect(creds); // Create job object that captures details of the job var mrJobDefinition = new MapReduceJobCreateParameters() { JarFile = "wasb:///example/jars/hadoop-examples.jar", ClassName = "wordcount" }; mrJobDefinition.Arguments.Add("wasb:///example/data/gutenberg/davinci.txt"); mrJobDefinition.Arguments.Add("wasb:///example/data/WordCountOutput"); // Submit job to the cluster var jobResults = jobClient.CreateMapReduceJob(mrJobDefinition);
  • 40. Demo
  • 41. Run a Hive Query • Hive Query via .NET Hadoop SDK • Download result from Hive query • Load result from Hive query direct in the data flow • Microsoft® Hive ODBC Driver http://www.microsoft.com/en-us/download/confirmation.aspx?id=40886 (available for x86 and x64)
  • 42. Demo