1. Big Data in Azure: Demo and Hands-On Labs
Pre-requisites
You will needanAzure subscriptionwithavailableHDInsightcores
PowerBI / Excel 2013
o Downloadthe PowerQueryadd-in,choose 32bitor64bit to match your Office installation
http://www.microsoft.com/en-us/download/details.aspx?id=39379&CorrelationId=d8002172-0438-
4ef5-b0fa-e635f8f17251
o Enable PowerPivotandPowerViewinyourExcel options –com add-ins.
DownloadHOLlabs https://github.com/Azure-Readiness/CloudDataCamp.ForApril 30 onlyuse
https://github.com/cindygross/CloudDataCamp instead. If youalreadyhave GitHubinstalled,choose to“Clone
inDesktop”.Otherwise choose“DownloadZIP” andUNZIP the files.Save the locationtoaNotepadfile.
Data movement–one or both
o GUI: Install CloudXplorerhttp://clumsyleaf.com/products/downloads.Iwill be usingv3,youcan
downloadthe v3trial or the free v1 (withfewerfeatures).
o Cmd line:Install AzCopy http://azure.microsoft.com/en-us/documentation/articles/storage-use-
azcopy/.Save the install locationasyouwill needitlater,itwill defaultto(withoutthe x86on 32bit)
C:ProgramFiles(x86)MicrosoftSDKsAzureAzCopy.
Install SQL2014 SSMS http://www.microsoft.com/en-gb/download/details.aspx?id=42299
Today’sslides:http://tinyurl.com/lxutdd4
Goal
Understandhowto use some of the commonpiecesof anAzure hosted BigData and Analyticssolution.These
componentsare oftenpartof an Internet of Thingssolution,whichisacommonBig Data and Analyticsscenario.
At the endof thishands-onlabyouwill have:
o Createdan Azure storage accountand container thenloaded datatoit. You will alsouse thisaccountfor
storage of data generatedinothersteps.
o Create a Hadoop onAzure instance (HDInsight),addedstructure (tables) storedinHCatalog,andqueried
the data on the storage account usingHive.
o Connectedan AzureMLexperimenttoHive – Hadoopis “justanotherdata source”.
o Create and ran an Azure StreamAnalyticsjobthatreadsdata generatedonthe flyfromyourlaptopviaa
Service BusEventHub andoutputsaggregateddata to a SQL Azure database.
o UsedPowerBI to visualize andpresentthe data.
Labs
We’re goingto use a modifiedversionof the CloudDataCamphandson labs.Those labshave screenshotsandmore
detailedinstructionsthanwhatIhave below,please refertothe original docsif youneedmore detailedsteps.
Guidelines
Many nameswithinAzure have tobe globallyunique,tryprefixingserviceswithyourinitialsorcompanyname.
Some service namesmustbe all lowercase,it’seasiertomake all nameslowercase. Forthislabprefix all names
withthe same identifier. OpenNotepadandtype inthe name of the prefix youwill use.
Let’spicka single datacenteranduse it forall our work (thoughsome servicesare notyetavailableinall
regions).ForMontreal let’schoose EastUS. Note thatthisis NOTthe same as East US 2.
I suggestyoustart a single file inasimple editorlike Notepadandkeepall the links,names,andpasswords/keys
we use in thatcentral locationforthe durationof the labs.
2. HOL1: Intro to the Azure Portal
The detailedlab file isinthe CloudDataCampdownloadunderdocsoryou can getit here: https://github.com/Azure-
Readiness/CloudDataCamp/blob/master/HOL/HOL1-IntroductionToAzure.md
In Lab 1 we’ll create astorage account andload data withAzCopyand/orCloudXplorer.Thenwe’llcreate aSQL
Database,openthe firewall toourclientmachine,andcreate some SQLtablesforstructureddata. Nextwe’ll generate
some looselystructureddata,simulatinga“thing”or device thatgeneratessmall chunksof data.
Portals
Productionmanagementportal: https://manage.windowsazure.com/ - loginandchoose subscription
Previewportal:https://portal.azure.com/ - loginandchoose subscription
Storage Account (creation takes 2-3 minutes)
In the Previewportal https://portal.azure.com/ (resource groupings are notavailableinthe managementportal)
choose to create a newstorage account. New ->Data + Storage -> Storage.
o Name:Your prefix +storage.Mine is bddragonstorage.
o Pricing:LocallyRedundant. <select>
o Resource Group: New -> Your prefix +rg. Mine is bddragonrg.
o Subscription:use one subscriptionforall steps!
o Location:East US
o Diagnostics:Notconfigured
o Pinto Startboard:Yes
o <Create>
Still inthe previewportal,addacontainertothe storage account
o Name:data (thisname isrequireddue tothe waythe lab issetup)
o Accesstype:Private
Clickon Settings ->Keysin the storage account andcopy the name and primarykeyto your Notepadfile.
Ingest data
Either AZCopy
Opena commandprompt and change directories:
Cd c:ProgramFiles(x86)MicrosoftSDKsAzureAzCopy (withoutthe x86on 32bit OS)
Use youractual local directory,storage accountname,andstorage account key.
azcopy/Source:"{yourpath}CloudDataCampdata"/Dest:https://[storage account
name].blob.core.windows.net/data/input/DestKey:[storage accountkey] /S
If you installedCloudXploreryoucanadd the storage account and keyonthe “accounts”buttonthenviewthe
filesthere.
Note that youcan alsodrag/drop small filesfromyourlocal File ExplorertoCloudXplorer,butAzCopyisbetter
for largerfilesorautomatedprocesses.
Or CloudXplorer
Addyour storage account
Choose toadd a “folder”calledinputtothe data container
Drag the file from{yourpath}CloudDataCampdatatothe input“directory”underthe datacontaineronyour
account
Extra Credit
Try both AzCopyandCloudXplorer
Load the data from Bill’stalkyesterdaytoa DIFFERENTFOLDER. Create tablestoreferto them.Querythe
tables. Since Hive pointstodirectoriesandnotto single files,eachtype of datamust be in itsownfolder!
3. Azure SQL DB
Createa new SQLdatabase
In the previewportal https://portal.azure.com/ chooseNew ->Data+ Storage -> SQL Database
Name:cdcasa (thisisunique withinyourserverandishardcodedforthe demo)
Server:“Create a newserver”
o Name:Your prefix +SQL. Mine is bddragonsql
o ServerAdminLogin:Somethingyouwill remember,putitinyournotepad
o Password:Somethingyouwill remember,putitinyour notepad.If youare goingto use the same
passwordforotherservices,make it10+ characters withupper/lowercase,#,special character.
o Location:same as the rest(East US for Montreal)
o AllowAzure ServicestoAccessServer:Yes,checkthe box!(Veryimportant!)
o OK
SelectSource:BlankDatabase
PricingTier:Standard(cheapestisfine forthe demo)
Optional Configuration: leave atdefaults
Resource Group:the one we createdabove
Subscription:the same one we’vebeenusing
Choose toadd it to the Startboard.
<Create>(wait3-4 minutes)
Configurethefirewall
Openthe non-previewmanagementportal https://manage.windowsazure.com/.
Clickon the SQL Databasesinthe leftpane.
Highlightcdcasaand thenchoose Serversfromthe uppermenu (notthe database,the server).
Clickon the serveryoucreatedearlier(bddragonsql ismine) andgoto Configure.
Where itsays “CurrentClientIPAddress”choose “addtothe allowedIPaddresses”.
Doublecheckthat“WindowsAzure Services”issettoYes.
Choose save inthe bottombar.
CreateSQL schemasforASA
OpenSQL ServerManagementStudio(SSMS).Note thatthiscanoptionallybe done fromVisualStudio2013
withupdate 4 or later.
o ServerType:Database Engine
o ServerName:{yourSQLserver.database.windows.net} Forexample mineis
bddragonsql.database.windows.net.
o Authentication:SQLServerAuthentication(note inthe real worldneverloginwithyoursysadmin
account fordbo activities)
Login:the one you createdearlier
Password:the one youcreatedearlier
Choose the cdcasa database fromthe leftmenu(ObjectExplorer).
Cntl-Otoopen1_CreateSQLTable.sqlfrom C:{yourdirectory}CloudDataCampscriptsASA
Verifyyouare inthe cdcasa dataase (there’sadropdownbox overobjectexplorer)
Hit F5 or the Execute buttontorun it.
Note:It will be populatedlaterby ASA.
Create Event Hub for Data Ingestion
Openthe non-previewmanagementportal https://manage.windowsazure.com/
Clickon Service Bus inthe leftmenu
Choose New ->AppServices ->Service Bus -> EventHub -> CustomCreate
4. o EventHub Name:Your prefix +eh.Mine is bddragoneh
o Region:The same one we’ve beenusing
o Namespace:Create anewnamespace
o Namespace Name:Yourprefix +eh+ -ns(itwill defaulttothis)
o Choose nextusingarrowonbottomright
o PartitionCount:8
o Message Retention:2
o Choose the checkmarkto finish
Configure sharedaccess
o Clickon the newService Busnamespace
o Choose EventHubsfromthe topmenu
o Clickon the EventHub
o Choose Configure fromthe topmenu
o In the “sharedaccess policies” sectionaddapolicy
Name:mypolicy
Permissions:send, listen
Choose Save at the bottom
o Copythe policyname and itsprimarykeyto yourNotepadfile.
Generate Data (Device Sender)
Opena commandprompt
Cd {yourdirectory}CloudDataCamptoolsDeviceSender
Replace youractual valuesinthe belowcommand:
DeviceSenderGenerateDataToEventHub -n<eventHubNamespace>-e <eventHubName>-p<policyName>-k
<policyKey>
Paste the editedcommandintothe commandpromptandhit entertoexecute it.Youshouldsee aseriesof
“Messagesfiredontothe eventhub!”messagesindicatingdataisbeingsentfromyourmachine toAzure.
Do NOT close the window. Thisdatawill be usedlater.
HOL9: Azure Stream Analytics
Create Streaming Job
Openhttp://manage.windowsazure.com
Clickon New ->Data Services ->StreamAnalytics ->QuickCreate
o JobName:prefix +stream
o Region:(EastUS isn’tavailable yet –use East US 2)
o Regional MonitoringStorage Account:Create new
o NewStorage AccountName:prefix +streammonitor
Configure Streaming Job
Inputs
Clickon the jobyoujust created,choose Inputsfromthe topribbon,andclick“Add Input”.
Choose “Data stream”then“EventHub”.
EventHub Settings:
o InputAlias:MyEventHubStream(mustbe exactlythis)
o Subscription:Current
o Namespace:The one youcreatedinthe EventHub step(prefix + -ns)
o EventHub Name:The one you created
o Policy:mypolicy
o ConsumerGroup:$Default
5. Serializationsettings
o Format: JSON
o Encoding:UTF8
Output
In the streamingjob,choose Outputsfromthe upperribbonand“AddOutput”
Choose SQLDatabase
SQL Database Settings
o Outputalias:output
o Subscription:Current
o SQL Database:cdcasa
o ServerName:the one youcreatedearlier,prefix +sql
o Username/Password:The SQLadminaccount youcreated
o Table:AvgReadings
Query
Choose Queryfromthe upperribbon
Paste inand thenSAVE:
SELECT DateAdd(minute,-1,System.TimeStamp) as WinStartTime, system.TimeStamp as
WinEndTime, Type = 'Temperature', RoomNumber, Avg(Temperature) as AvgReading,
Count(*) as EventCount
FROM MyEventHubStream
Where Temperature IS NOT NULL
GROUP BY TumblingWindow(minute, 1), RoomNumber, Type
UNION
SELECT DateAdd(minute,-1,System.TimeStamp) as WinStartTime, system.TimeStamp as
WinEndTime, Type = 'Humidity', RoomNumber, Avg(Humidity) as AvgReading, Count(*) as
EventCount
FROM MyEventHubStream
Where Humidity IS NOT NULL
GROUP BY TumblingWindow(minute, 1), RoomNumber, Type
UNION
SELECT DateAdd(minute,-1,System.TimeStamp) as WinStartTime, system.TimeStamp as
WinEndTime, Type = 'Energy', RoomNumber, Avg(Kwh) as AvgReading, Count(*) as
EventCount
FROM MyEventHubStream
Where Kwh IS NOT NULL
GROUP BY TumblingWindow(minute, 1), RoomNumber, Type
UNION
SELECT DateAdd(minute,-1,System.TimeStamp) as WinStartTime, system.TimeStamp as
WinEndTime, Type = 'Light', RoomNumber, Avg(Lumens) as AvgReading, Count(*) as
EventCount
FROM MyEventHubStream
Where Lumens IS NOT NULL
GROUP BY TumblingWindow(minute, 1), RoomNumber, Type
Start Steaming Job
Clickon Start inthe bottomribbon,choose default(JobStartTime)
VerifyDeviceSenderisrunning(orrestartit)
View Data in SQL
Aftera fewminutesyoucanquerythe SQL database fromSSMS and see the data inAvgReadings.
6. Stopthe DeviceSenderappif it’sstill running.
You have successfullyingesteddatafroma“thing” (yourlaptop) toAzure!Youpushedthatdata througha query
(streaming) andsentthe aggregatedoutputtoa destinationinthe cloud –Azure SQL Database.
---- Backto SLIDES -----
HOL2: Intro to HDInsight
In lab2 we create a HadoopclusterinAzure usingthe HDInsightservice.Thenwe RDPtothe headnode and see thatit’s
trulyApache opensource HadooprunningonWindows.HDInsightisalsoavailable onLinux butwe are usingWindows
for the lab.
Create an HDInsight Hadoop cluster
Loginto https://manage.windowsazure.com/
Choose HDInsight(the elephant)fromthe leftmenu
Choose New ->Data Services->HDInsight-> CustomCreate
Page 1 / ClusterDetails
o ClusterName:Yourprefix +hdi
o ClusterType:Hadoop
o OperatingSystem:Windows
o Version:default
Page 2 / Configure Cluster
o Data Nodes:1
o Region:the same regionyou’ve beenusing,the storage account mustbe inthe same region
o HeadNode Size:defaultA3
o Data Node Size:defaultA3
Page 3 / Configure ClusterUser
o Name:Your prefix +admin(youcan use the same as the SQL db forthe demobutdon’tdo that in
production)
o Password:(youcan use the same as the SQL db for the demobutdon’tdo that in production)
o Enable the remote desktopforcluster:Yes(youwill generallychoose no)
RDP User Name:clustername + 1 (don’tdothisinproduction)
RDP Password:(youcanuse the same as the SQL db for the demo butdon’tdo that in
production)
ExpiresOn:tomorrow
o Enter the Hive/Oozie Metastore:No(youwillgenerallychoose yesforproduction)
Page 4 / Storage Account
o Storage Account:Use existingstorage
o AccountName:the storage account we createdearlier
o DefaultContainer:data
o Additional Storage Accounts:0
Page 5 / ScriptActions
o Clickthe arrow to create the cluster,waitabout15 minutes
Use the Hadoop Distributed File System (HDFS)
RDP to the headnode
Get a listingof files
o Hadoopfs –ls /
o Hadoopfs –ls /example/data
---- Backto SLIDES -----
7. HOL3: HDI Batch Analysis and Power BI
We’ll dosome batch analysisandcreate aggregations.Thenwe willviewthe datainPowerBI.
Hive
Navigate toCloudDataCampscriptsHiveinyourfile explorer.
In the Azure managementportal clickonyourHDInsightinstance.ClickonQueryConsole atthe bottomof the
screento opena querywindow.Loginwiththe clustercredentials(notthe RDPcredentials).
Choose the Hive editor.
Createan External Table DeviceReadings
OpenCloudDataCampscriptsHive1_CreateDeviceReadings.txt inatexteditorlike Notepad.
Update the location:replace <storage accountname>withthe storage account youcreatedin Handson Lab 1
(remove the brackets).Paste the editedqueryintothe Hive editorandhitSubmittocreate a Hive table.
LOCATION 'wasb://data@<storage account name>.blob.core.windows.net/input';
Viewthe joboutput – it opensina newwindow.Foracreate schemastatementyouwanttoverifythere are no
errors(the messagesaboutloggingare noterrors).It will show the time taken.
Query thetable
Copythe belowqueryandrun itfrom the Hive editor:
SELECT deviceId FROM DeviceReadings LIMIT 100;
Viewthe joboutput.
CreateExternalTables for Averages
Create and populate tablesthatstore aggregates.
OpenCloudDataCampscriptsHive2_CreateAverageReadingByType.txt.
Editthe location andrun fromthe Hive editor.
Repeatchangingthe locationandexecuting the remainingcreate/insertscripts:
CloudDataCampscriptsHive3_CreateAverageReadingByMinute.txt.
CloudDataCampscriptsHive4_CreateMaximumReading.txt.
CloudDataCampscriptsHive5_CreateMinimumReading.txt.
File Browser
The locationof the data wasspecifiedinthe table creationstatementsusinglocation. The browsershowsdataonthe
defaultstorage accountforthe cluster.
Viewthe original andthe aggregateddatainthe File Browsertabof the console.
If you have CloudXplorer,viewthe datainCloudXplorer (hitrefresh).
Extra Credit
Write SELECT statementstovieweachtable’sdataset.
Write more complex queries.
Showtables;
describe formattedAverageReadingByType;
Connectto Hive fromPowerPivotusingthe MicrosoftHive ODBCdriveranda DSN
AzureML
Connectto HadoopfromAzureML. Note thatthisis notin the CloudDataCamp,the HOL10 inthat seriespointstoa flat
file andhere we use a Hive query.
8. From manage.windowsazure.com, clickonAzureMLandchoose tosignin to yourAzureML studio.
Choose a new blankexperiment.
Drag a Readerfromthe lefttothe designer.
Highlightthe Readerandviewthe optionsyouhave forconnecting.
o Data source:Hive Query
o Hive database query:SELECT * FROM AverageReadingByType
o HCatalogserverURI: http://{yourhdicluster}.azurehdinsight.net
o Hadoopuser accountname:your clusteradmin(notrdp) account
o Hadoopuser accountpassword:yourpassword
o Locationof outputdata: Azure
o Azure storage account name:{yourstorage account}
o Azure storage key:{yourkey}
o Azure containername:data
Choose Save and Run fromthe bottom ribbon
Whenit completesviewthe resultsdatasetbyrightclickingonthe circle andeithervisualize ordownload
Reference:https://andersspur.wordpress.com/2014/10/10/use-hive-to-read-data-into-azure-ml/
ClusterCleanup
At thispointwe have newdatasetscreatedbasedonaggregatesof ourfirst,static data file.We couldeitherleave the
clusterupand queryit directlyfromtoolslike PowerBIusing Hive ordrop the clusteranddirectlyaccessthe data inthe
flatfiles.We’lluse the latter–flatfiles.Thisemphasizesthatthese are on-demandclusters,youdon’tneedtopayto
keepthemupall the time.
Drop the HDInsightcluster.
PowerQuery
Opena newworkbookinExcel 2013. VerifyyouinstalledandenabledPowerQuery.
Clickon PowerQuery.
Choose FromAzure -> FromMicrosoft Azure HDInsight.
Enter the storage account youcreatedearlierandthe keyyousavedinNotepad.
In Navigatorexpandyourstorage accountand double-clickonthe containernameddatatoopenthe query
editor.
Findthe “FolderPath”columnon the far rightand choose the dropdownarrow.
Enter outputinthe search box andyou’ll see the ‘directories’ andfileswe have createdtoday.
If you chose ok,in “AppliedSteps”onthe far rightclickthe red X nextto“FilteredRows”toremove thisfilter.
Create a newfiltertoaverageReadingByMinute –thiswill show asingle row (because we hadasmall amountof
data and onlyran the insertonce we onlyhave one file inthatdirectory). Choose ok.
Scroll back to the leftandin the “Content”columnclickon“Binary” to importthe file.
Name the columns:DeviceType,ReadingDateTime,RoomNumber,Reading
Choose “Close &Load” from the upper lefttocreate a new sheetcalledAverageReadingByMinute.
Save the workbooktoyour desktop.
PowerView
Go to the workbookcreatedinthe laststep.
Choose the Inserttabat the top thenchoose PowerView inthe middle of the top.
It ispopulatedwiththe table fromthe worksheet –youcan see the columnsin“PowerView Fields”onthe right.
Note that the numericfieldshave asumfigure nexttothem.We don’twantto summarize roomnumber,sogo
to the bottomof the “PowerViewFields”inthe “Fields”sectionandchoose “DoNotSummarize”for
RoomNumber.
9. Clickinside the table inthe reportdesignerpane (left).Inthe Designmenuiteminthe ribbontothe rightof
“PowerView”choose “OtherChart”->“Line”.
In the Filterssectionchoose Chart.
ExpandDeviceType andputa checknextto energy.
Editthe title to“EnergyReadingByMinute”.
Save the workbookandclose it.
You have nowdone distributedprocessingwithHadooponAzure (HDInsight) utilizingthe powerof WASBto accessthat
same data outside of Hadoop.YouthenusedPowerBI to discoverandvisualizethatdata,openingupthe possibilities
for newdata-driveninsights.
Cleanup
Verifyyouhave droppedyourHDInsightcluster –youare chargedfor itsexistence whetheryouare running
anythingornot.
Stopthe DeviceSenderappif it’sstill running.
Drop the otherresourceswe’ve created –theyhave minimal costsif youaren’tactivelyusingthem.
o StreamingJob
o EventHub (underService Bus)
o Service Busnamespace
o Storage
o SQL Azure Database cdcasa (andoptionallythe hostingSQLServer)
o AzureMLExperiments
o Resource Group
Optionallydelete the Excel workbook.
Optionallyremove some orall filesandtoolsfromthisworkshop
o CloudDataCampfolderandall files
o CloudXplorer
o AzCopy
o DeviceSender