This document discusses the rapid growth of digital data and the challenges of analyzing large, unstructured datasets. It notes that in just one week in 2000, the Sloan Digital Sky Survey collected more data than had been collected in all of astronomy previously. Today, the Large Hadron Collider generates 40 terabytes per second and Twitter generates over 1 terabyte of tweets daily. By 2013, annual internet traffic was predicted to reach 667 exabytes. Hadoop provides a framework to analyze these vast and diverse datasets by distributing processing across commodity clusters close to where the data is stored.
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
Galaxy of bits
1. Galaxy of bits
Surviving the flood of information
Michał Żyliński, Microsoft
(michal.zylinski@microsoft.com)
2. In 2000 the Sloan Digital Sky Survey collected more data in its 1st
week than was collected in the entire history of Astronomy
By 2016 the New Large Synoptic Survey Telescope in Chile will
acquire 140 terabytes in 5 days - more than Sloan acquired in 10
years
The Large Hadron Collider at CERN generates 40 terabytes of data
every second
2
Sources: The Economist, Feb ‘10; IDC
3. Bing ingests > 7 petabyte a month
The Twitter community generates over 1 terabyte of tweets every day
Cisco predicts that by 2013 annual internet traffic flowing will reach 667
exabytes
3
Sources: The Economist, Feb ‘10; DBMS2; Microsoft Corp
4. 1,800,000,00
1,8 0,000,000,00
0,000 bytes
The size of Digital Universe in
ZB 2011
9
8
7
6
5
Within 24 months #
of intelligent devices >
traditional IT devices
4
3
2 In 2015 nearly 20%
1
0 of the information will
2010 2011 2012 2015 be touched by cloud
Sources: IDC Digital Universe Study 2011, Worldwide Big Data Technology and Services 2012–2015 Forecast
10. So how does it work?
SECOND, TAKE THE PROCESSING TO THE DATA
// Map Reduce function in
JavaScript
var map = function
(key, value, context) {
var words =
value.split(/[^a-zA-Z]/);
for (var i = 0; i <
words.length; i++) {
if
(words[i] !== "")
{context.write(words[i].to
LowerCase(), 1);}
}};
var reduce = function
(key, values, context) {
var sum = 0;
while (values.hasNext()) {
sum +=
parseInt(values.next());
}
context.write(key, sum);
};
11. Hadoop in detail
Analysis of semi and unstructured data distributed across a commodity cluster
Based on Google’s MapReduce paper
and Google File system (GFS)
Programs = Sequence of “map” and
“reduce” tasks.
Simplify writing distributed applications
Highly fault tolerant – multiple copies
Move computation close to data
Implemented in Java and optimized for
Linux
14. Traditional RDBMS MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
DBA Ratio 1:40 1:3000
16. Hadoop + Microsoft
Our own • Submit changes back to
distribution of Apache Foundation
Hadoop • Download for free
• AD & Systems Center
Optimized for integration
Windows & Azure • Hadoop-as-a-service-on-
Azure
Focus on .NET • Integration with Visual Studio
Developers • Support for C#
• Performance and Scale
• High Availability
• Ease of use
17. Why Hadoop as a Service?
• Task based billing
• Easy admin
• Zero install
• Support a wide variety of job types
– Machine Learning (mahout), Graph Mining
(Pegasus), HIVE, Pig, Java, JS, etc.
• Greatly simplified UI
cheap fast
27. Benefits
Some other fancy stuff...
Models augmented with
publicly available data
from social media sites
Key Features
Microsoft
Codename
"Social Analytics"
29. Reality check A.D. 2012
ANALYTICS
SELF-SERVICE MOBILE
OPERATIONAL REAL-TIME
PREDICTIVE COLLABORATIVE
MARKETPLACE
DATA ENRICHMENT
External Data
and Services
DISCOVER TRANSFORM SHARE
AND RECOMMEND AND CLEAN AND GOVERN
DATA MANAGEMENT
1
011
01
RELATIONAL NON RELATIONAL MULTIDIMENSIONAL STREAMING
30. Use Case:
• Extremely large volume of
Microsoft unstructured web log
BI Tools analysis
• Ad hoc analysis of
unstructured web logs to
prototype patterns
• Hadoop data feeds large
24TB Cube
24 TB Cube
Hadoop Distribution
Share and collaborate via Windows Azure Marketplace:The Microsoft Big Data solution enables customers to share data and insights through Windows Azure Marketplace, which exposes hundreds of applications and data mining algorithms from Microsoft and third parties to help unlock unprecedented insights for customers. Microsoft’s Hadoop based service for Windows Azure offers seamless connection to Azure Marketplace through the Open Data (ODATA) Protocol.
Integrate with social media:Microsoft’s Big Data solution enables customers to augment their analysis with publicly available data from social media sites (such as Twitter and Facebook) and hundreds of trusted data providers on Windows Azure Marketplace. Microsoft Codename "Social Analytics" allows for integration of social information with business applications.