SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)

SQLSaturday #230 Rheinland
Sascha Dittmann
Softwarearchitekt & Entwickler – Ernst & Young GmbH
www.sascha-dittmann.de
Georg Urban
Snr. Technology Solution Professional | Data Platform
georg.urban@microsoft.com
13.07.2013

Big Data Characteristics: „3 Vs“

How to deal with the „3 Vs“?

A brief history of Hadoop
2002: Apache Nutch open source search engine ist started by Doug Cutting
2003: Google publishes a paper on GFS (Google Distributed File System)
2004: Nutch Distributed Files System (NDFS) is developed
2004: Google publishes a paper on MapReduce
2005: MapReduce is implemented on NDFS
2006: Doug Cutting joins Yahoo! & starts Apache Hadoop subproject
2008: Hadoop is made a Apache top level project.
…Yahoo„s search index runs on a 10.000 node cluster
…Hadoop breaks record on 1TB sort: 209s on 910 nodes
...New York Times converts 4TB archives in PDFs in 24h on 100 nodes
http://labs.google.com/papers/mapreduce.htm
Today: Hadoop becomes a synonym for Big Data processing

Hadoop: The popular Face of Big Data

RDBMS & Hadoop Comparison
Traditional RDBMS MapReduce
Data Volume Terabytes Petabytes / Hexabytes
Access Interactiv & Batch Batch
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low (BASE*)
Scaling non linear Linear
DBA Ratio 1:40 1:3000
Quelle: Tom White’s Hadoop: The Definitive Guide
*Basically Available, Soft state, Eventual consistency

MapReduce is simple… (well: basically)

The Hadoop Ecosystem (simplified)
Quelle: Tom White’s Hadoop: The Definitive Guide

The Hadoop Ecosystem (parts of it…)
HBase (Column DB)
Hive Mahout
Oozie
Sqoop
HBase/Cassandra/Couch/
MongoDB
Avro
Zookeeper
Pig
Karmasphere
Flume
Cascad-
ing
R
Ambari
HCatalog
Datameer
Hortonworks
Cloudera
SplunkHStreaming
MapRHadapt
Hadoop = MapReduce + HDFS

There‟s even more: Mahout for machine
learning
 Scalable machine learning library that leverages
the Hadoop infrastructure
 Key use cases:
 Recommendation mining
 Clustering
 Classification
 Algorithmns:
K-means Clustering, Naïve Bayes,
Decision Tree, Neural network,
Hierarchical Clustering,
Positive Matrix Factorization and more…

R for statistical computing
 An open and extensible statistical
computing environment
 Based on the S language
 Used by Data Scientists to
explore data and
generate graphical output
 A well-developed
programming language
 Many “Packages” available
to extend R

…but: That‟s not Enterprise ready… Really not…

Big Data in the Enterprise should…
fit in an present IT Infrastructure
be easy to manage
rely on existing skill sets
be cost optimized

Why Apache Hadoop on Windows?
 According to IDC Windows Server held 73% market share in 2012
 Hadoop was traditionally built for Linux servers so there are a large number of underserved organizations
 According to 2012 Barclays CIO study big data outranks
virtualization as #1 trend driving spending initiatives
 Unstructured data growth exceeds 80% year/year in most enterprises
 Apache Hadoop is the defacto big data platform
for processing massive amounts of unstructured data
 Complementary to existing Microsoft technologies
 There is a huge untapped community of Windows developers and ecosystem partners
 A strong Microsoft-Hortonworks partnership and 18 months of development makes this a natural next step

OS Cloud VM Appliance
Enterprise Hadoop Distribution Hortonworks
Data Platform (HDP)
Hadoop
designed for Enterprises
The “really complete“ Open
Source Distribution
Eco-System designed for
InteroperabilityPLATTFORM SERVICES
HADOOP CORE
DATA
SERVICES
OPERATIONAL
SERVICES
Management
of Hadoop
Environment
Store, Process
&
Connect
HORTONWORKS
DATA PLATFORM (HDP)
Distributed
Data Storage & Processing
Enterprise Availability

Leadership that Starts at the Core
 Driving next generation Hadoop
 YARN, MapReduce2, HDFS2, High Availability, Disaster Recovery
 420k+ lines authored since 2006
 More than twice nearest contributor
 Deeply integrating w/ecosystem
 Enabling new deployment platforms
 (ex. Windows & Azure, Linux & VMware HA)
 Creating deeply engineered solutions
 (ex. Teradata big data appliance)
 All Apache, NO holdbacks
 100% of code contributed to Apache

HDInsight Windows optimized
Hadoop
Big Data @Microsoft
Microsoft HDInsight Server on Windows Server
Windows Azure HDInsight Service (Cloud)
Enterprise Ready Hadoop
Simplicity & Managebility of Windows
AD Integration
Monitoring (System Center)
Integrated in Microsoft Business Intelligence
JavaScript, HiveODBC, .NET
…
Up and running in minutes with HDInsight Service

Microsoft Big Data Solution (two months ago…)

Windows Azure: Elastic Big Data

Windows Azure HDInsight Service
Hadoop Cluster

Hadoop on Azure
Azure Blob
Storage
Name
Node
Data
Node
Data
Node
Data
Node
Data
Node
HDFS
On Premise Enterprise
Content
• Transactional DBs
• On Prem logs
• Internal sensors
Cloud Enterprise Content
• Generated in Azure
3rd Party Content
• Azure Datamarket
• Generated/stored
elsewhere
• Public content
• Delivered online
Azure Blob
Storage
SQL Azure
Application
end point

Using Blob Storage From HDInsight
 HDInsight cluster is bound to one “default” blob storage account
& container at cluster create time
 Using the “default” container requires no special addressing to
access (“/” == root folder, etc)
 Access additional blob storage accounts or containers:
 Storage accounts need to be registered in site-config.xml:
asv[s]://<container>@<account>.blob.core.windows.net/<path>
<property>
<name>fs.azure.account.key.accountname</name>
<value>enterthekeyvaluehere</value>
</property>

Transporting Data with AzCopy
 Utility for moving data to/from Azure Blob Storage
(like robocopy)
 50MB/s transfer rate in data center
Container Blob Name
mycontainer a.txt
mycontainer b.txt
mycontainer dir1c.txt
mycontainer dir1dir2d.txt

Map/Reduce
Map
Sort
Shuffle
DataNode
Map
Sort
Shuffle
DataNode
Map
Sort
Shuffle
DataNode
Reduce
0067011990999991950051507004+68750
0043011990999991950051512004+68750
0043011990999991950051518004+68750
0043012650999991949032412004+62300
0043012650999991949032418004+62300
1949,0
1950,22
1950,55
1952,-11
1950,33
1949,0
1950,[22,33,55]
1952,-11
1949,0
1950,55
1952,-11

Map/Reduce mit Combine
Map
Combine
Sort
Shuffle
DataNode
Map
Combine
Sort
Shuffle
DataNode
Map
Combine
Sort
Shuffle
DataNode
Reduce
0067011990999991950051507004+68750
0043011990999991950051512004+68750
0043011990999991950051518004+68750
0043012650999991949032412004+62300
0043012650999991949032418004+62300
1949,0
1950,22
1950,55
1952,-11
1950,33
1949,0
1950,55
1952,-11
1950,33
1949,0
1950,[33,55]
1952,-11
1949,0
1950,55
1952,-11

Verfeinern mit Pig Latin
pig
.from("/user/Sascha/input/twitter")
.mapReduce("/user/…/FollowersCount.js"
, "User, Followers:long")
.orderBy("Followers DESC")
.take(10)
.to("/user/Sascha/output/Top10Followers")

.NET Job Submission Framework (Map)

.NET Job Submission Framework (Reduce)

Vielen Dank an die Volunteers!
13.07.2013 |

Große Verlosung!
 Am Ende der Veranstaltung (ca. 18:00 Uhr)
 Gewinnt viele Preise!
 Deshalb:
13.07.2013 |
Besucht unsere Sponsoren!

Unsere „You Rock! “ Sponsoren
13.07.2013 |

Vielen Dank an all unsere Sponsoren!
13.07.2013 |
Gold
Silber
Bronze

Hands-on event: PASS Camp 2013!
13.07.2013 |

SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)

Similaire à SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1) (20)

Plus de Sascha Dittmann

Plus de Sascha Dittmann (17)

Dernier

Dernier (20)

SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)

Notes de l'éditeur