In dieser Session stellen wir anhand eines praktischen Szenarios vor, wie konkrete Aufgabenstellungen mit HDInsight in der Praxis gelöst werden können:
- Grundlagen von HDInsight für Windows Server und Windows Azure
- Mit Windows Azure HDInsight arbeiten
- MapReduce-Jobs mit Javascript und .NET Code implementieren
7. A brief history of Hadoop
2002: Apache Nutch open source search engine ist started by Doug Cutting
2003: Google publishes a paper on GFS (Google Distributed File System)
2004: Nutch Distributed Files System (NDFS) is developed
2004: Google publishes a paper on MapReduce
2005: MapReduce is implemented on NDFS
2006: Doug Cutting joins Yahoo! & starts Apache Hadoop subproject
2008: Hadoop is made a Apache top level project.
…Yahoo„s search index runs on a 10.000 node cluster
…Hadoop breaks record on 1TB sort: 209s on 910 nodes
...New York Times converts 4TB archives in PDFs in 24h on 100 nodes
http://labs.google.com/papers/mapreduce.htm
Today: Hadoop becomes a synonym for Big Data processing
9. RDBMS & Hadoop Comparison
Traditional RDBMS MapReduce
Data Volume Terabytes Petabytes / Hexabytes
Access Interactiv & Batch Batch
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low (BASE*)
Scaling non linear Linear
DBA Ratio 1:40 1:3000
Quelle: Tom White’s Hadoop: The Definitive Guide
*Basically Available, Soft state, Eventual consistency
11. The Hadoop Ecosystem (simplified)
Quelle: Tom White’s Hadoop: The Definitive Guide
12. The Hadoop Ecosystem (parts of it…)
HBase (Column DB)
Hive Mahout
Oozie
Sqoop
HBase/Cassandra/Couch/
MongoDB
Avro
Zookeeper
Pig
Karmasphere
Flume
Cascad-
ing
R
Ambari
HCatalog
Datameer
Hortonworks
Cloudera
SplunkHStreaming
MapRHadapt
Hadoop = MapReduce + HDFS
13. There‟s even more: Mahout for machine
learning
Scalable machine learning library that leverages
the Hadoop infrastructure
Key use cases:
Recommendation mining
Clustering
Classification
Algorithmns:
K-means Clustering, Naïve Bayes,
Decision Tree, Neural network,
Hierarchical Clustering,
Positive Matrix Factorization and more…
14. R for statistical computing
An open and extensible statistical
computing environment
Based on the S language
Used by Data Scientists to
explore data and
generate graphical output
A well-developed
programming language
Many “Packages” available
to extend R
17. Big Data in the Enterprise should…
fit in an present IT Infrastructure
be easy to manage
rely on existing skill sets
be cost optimized
18. Why Apache Hadoop on Windows?
According to IDC Windows Server held 73% market share in 2012
Hadoop was traditionally built for Linux servers so there are a large number of underserved organizations
According to 2012 Barclays CIO study big data outranks
virtualization as #1 trend driving spending initiatives
Unstructured data growth exceeds 80% year/year in most enterprises
Apache Hadoop is the defacto big data platform
for processing massive amounts of unstructured data
Complementary to existing Microsoft technologies
There is a huge untapped community of Windows developers and ecosystem partners
A strong Microsoft-Hortonworks partnership and 18 months of development makes this a natural next step
19. OS Cloud VM Appliance
Enterprise Hadoop Distribution Hortonworks
Data Platform (HDP)
Hadoop
designed for Enterprises
The “really complete“ Open
Source Distribution
Eco-System designed for
InteroperabilityPLATTFORM SERVICES
HADOOP CORE
DATA
SERVICES
OPERATIONAL
SERVICES
Management
of Hadoop
Environment
Store, Process
&
Connect
HORTONWORKS
DATA PLATFORM (HDP)
Distributed
Data Storage & Processing
Enterprise Availability
20. Leadership that Starts at the Core
Driving next generation Hadoop
YARN, MapReduce2, HDFS2, High Availability, Disaster Recovery
420k+ lines authored since 2006
More than twice nearest contributor
Deeply integrating w/ecosystem
Enabling new deployment platforms
(ex. Windows & Azure, Linux & VMware HA)
Creating deeply engineered solutions
(ex. Teradata big data appliance)
All Apache, NO holdbacks
100% of code contributed to Apache
21. HDInsight Windows optimized
Hadoop
Big Data @Microsoft
Microsoft HDInsight Server on Windows Server
Windows Azure HDInsight Service (Cloud)
Enterprise Ready Hadoop
Simplicity & Managebility of Windows
AD Integration
Monitoring (System Center)
Integrated in Microsoft Business Intelligence
JavaScript, HiveODBC, .NET
…
Up and running in minutes with HDInsight Service
26. Hadoop on Azure
Azure Blob
Storage
Name
Node
Data
Node
Data
Node
Data
Node
Data
Node
HDFS
On Premise Enterprise
Content
• Transactional DBs
• On Prem logs
• Internal sensors
Cloud Enterprise Content
• Generated in Azure
3rd Party Content
• Azure Datamarket
• Generated/stored
elsewhere
• Public content
• Delivered online
Azure Blob
Storage
SQL Azure
Application
end point
27. Using Blob Storage From HDInsight
HDInsight cluster is bound to one “default” blob storage account
& container at cluster create time
Using the “default” container requires no special addressing to
access (“/” == root folder, etc)
Access additional blob storage accounts or containers:
Storage accounts need to be registered in site-config.xml:
asv[s]://<container>@<account>.blob.core.windows.net/<path>
<property>
<name>fs.azure.account.key.accountname</name>
<value>enterthekeyvaluehere</value>
</property>
28. Transporting Data with AzCopy
Utility for moving data to/from Azure Blob Storage
(like robocopy)
50MB/s transfer rate in data center
Container Blob Name
mycontainer a.txt
mycontainer b.txt
mycontainer dir1c.txt
mycontainer dir1dir2d.txt
In that capacity,Arun allows Hortonworks to be instrumental in working with the community to drive the roadmap for Core Hadoop, where the focus today is on things like YARN, MapReduce2, HDFS2 and more.For Core Hadoop, in absolute terms, Hortonworkers have contributed more than twice as many lines of code as the next closest contributor, and even more if you include Yahoo, our development partner. Taking such a prominent role also enables us to ensure that our distribution integrates deeply with the ecosystem: on both choice of deployment platforms such as Windows, Azure and more, but also to create deeply engineered solutions with key partners such as Teradata.And consistent with our approach, all of this is done in 100% open source.