This document summarizes a presentation on tier-1 business intelligence (BI) in the world of big data. The presentation will cover Microsoft's BI capabilities at large scales, big data workloads from Yahoo and investment banks, Hadoop and the MapReduce framework, and extracting data out of big data systems into BI tools. It also shares a case study on Yahoo's advertising analytics platform that processes billions of rows daily from terabytes of data.
1. October 11-14, Seattle, WA
Tier-1 BI in the World of
Big Data
SQLCAT
Speaker Name
Thomas Kejser, Denny Lee – Microsoft
w/ special guest Kenneth Lieu – Yahoo!
2. Questions?
• Are you interested in how to build, deploy, and
maintain multi-terabyte cubes?
• This is 400+ level information
• “Does Hadoop give you a case of hives”
• If you understand the pun, definitely stay
• What does Tier-1 or enterprise mean to you?
• You might not like this presentation if you answered
in gigabytes
BIA-408-A | SQLCAT: Tier-1 BI in the world of Big Data 2
3. Agenda
Microsoft BI Today
Two different workloads, same challenge
• Ad Analytics
• Investment Banks
HADOOP: The mother of all stovepipes
The Big Shuffle
Getting data OUT of BigData
BIA-408-A | SQLCAT: Tier-1 BI in the world of Big Data 3
5. Microsoft BI Today
Two Data Models
• Dimensional (UDM)
• Tabular (The model formerly known as BISM)
UDM is the current large scale engine
• Yahoo!’s 24TB cube
• Multi-terabyte cubes are quickly becoming the
norm
BIA-408-A | SQLCAT: Tier-1 BI in the world of Big Data 5
6. UDM Scale Themes
• Get that hardware balance right (yes, we have to
talk about IOPS)
• Repeat after me: partitioning, partitioning,
partitioning!
• Multi-user query concurrency– how to handle it
• Keeping it simple
• Locking – how it works – and how to work around it
• What? Did you say ROLAP?
BIA-408-A | SQLCAT: Tier-1 BI in the world of Big Data 6
7. UDM Guidance
sqlcat.com
• Analysis Services 2008R2 Performance Guide
• Analysis Services 2008R2 Operations Guide
• SSAS Maestro Course (Tech Level 500, 5 day
course
BIA-408-A | SQLCAT: Tier-1 BI in the world of Big Data 7
9. 680,000,000Visitors to Yahoo! Branded sites:
Ad Impressions: 3,500,000,000(perday)
Refresh Frequency: Hourly
464,000,000,000(perqtr)Rows Loaded:
Average Query Time: <10 seconds
Yahoo! TAO Technical Requirements
10. 5,000,000,000Risk Vectors reloaded / 30 min:
Total Vectors Loaded: 600,000,000,000
Refresh Frequency: Seconds
ThousandsTotal Concurrent, active Queries:
Average Query Time: Seconds
Investment Banks - Technical Requirements
11. Workload Scale Themes
Old Themes:
• Getting I/O right (solved!)
• Getting configuration right (solved: Maestro and Fast Track)
• Getting Data Models right (done, but spread the word)
• SMP User concurrency (done!)
• SMP Scaled ETL (done, World Record)
New Themes:
• Cheap storage at scale
• Massive query scale (both size and concurrency)
• Scaling ETL another order of magnitude
• Scaled and Integrated Reporting/BI
....What did we learn?
BIA-408-A | SQLCAT: Tier-1 BI in the world of Big Data 11
12. DW/BI Scale is getting expensive…
Component Current Max Example Hardware
Cores 128 (256) SGI Altix UV 100
Memory 2TB IBM x5 Series
HP DL980
Attached Storage Capacity
(at reasonable speed)
200-400 TB? Custom build DAS
HP P9500
EMC Symmetrix
Hitachi HDS
Max Table Scan Speed 36GB/sec HP DL980
Max IOPS 1M IOPS FusionIO Octals
2 x Dedicated, enterprise Grade SAN
Max Bulk Speed 16 M rows/sec Unisys 7600R
Max Extract Speed 41M rows/sec 4 x 10Gbit Ethernet
64 cores dedicated Server
Biggest Cube 24TB
Largest Single DB 75TB HP Superdome, 128 Cores
Dedicated SAN
13. Compression to the Rescue?
Example Compression Rates with Column
Store/VertiPaq:
• Web Logs 30:1
• Trade Risk Vectors 9:1
Good news: Columnar compression can shave off an
order of magnitude
Bad news: But you still have a lot more data than you
can comfortably handle in a single box
BIA-408-A | SQLCAT: Tier-1 BI in the world of Big Data 13
14. What we saw and see…
Murphy’s Law Risk Countermeasure
Programmers use more CPU
than you have
Can’t add beyond max cores You try to scale out
You scaled out You get WORSE scale Poor you!
You bought too little hardware System is unresponsive Buy too much hardware
You bought too much hardware You wasted money Poor you!
Programmer “forgot” to write
multi threaded code
You buy more hardware, the
system scales WORSE!
Rework code
You reworked code You “forgot” how hard it is to
write multi threaded code
Poor you!
You capacity planned disks
wrong
You run out of disk space
System is down
You bought at big SAN to
compensate
You bought at big SAN to
compensate
You wasted money Poor you!
BIA-408-A | SQLCAT: Tier-1 BI in the world of Big Data 14
15. ....IT JUST WASN’T ENOUGH!
Session Code | Session Title 15
HADOOP is the
Mother of all
Stovepipes
Stastistics, they
catch you
SATA drives?
Really?
Serialization
It is always day
zero
Code is free
The IP is in
the data Map/Reduce
Name/Value
pairs?
MTTI
Cheap, Fast,
Quality, choose two
not three
16. Scale - What are we trying to achieve?
0
500
1000
1500
2000
2500
3000
0 4 8 12 16 20 24
Throughput
Some Hardware Resource
Good
So so
Bad
We want
to live here
18. Statistics catch up with you
In a large system, something is ALWAYS broken
Mirrors are no longer enough
• Clone breaks before it master can be
reestablished
• Example: Azure uses three copies of data
User queries run wild, get killed, racks overheat,
network switches die etc…
= Design for failure!
BIA-408-A | SQLCAT: Tier-1 BI in the world of Big Data 18
19. NoSQL ecosystem | open source, commodity
Cassandra
Hive
Scribe
Hadoop
Hadoop
Oozie
Pig (-latin)
BackType
Hadoop
Pig / Hbase
Cassandra
MR/GFS
Bigtable
Dremel
…
SimpleDB
Dynamo
EC2 / S3
…
Internal [ Dryad | Cosmos] and External [ Isotope | Azure | Excel | BI | SQL DW | LTH ]
Mahout | Scalable machine learning and data mining
MongoDB | Document-oriented database (C++)
Couchbase | CouchDB (doc dB) + Membase (memcache protocol)
Hbase | Hadoop column-store database
R | Statistical computing and graphics
Pegasus | Peta-scale graph mining system
Lucene | full-featured text search engine library
20. Comparing RDBMS and MapReduce
Traditional RDBMS MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low (BASE)
Scaling Nonlinear Linear
DBA Ratio 1:40 1:3000
Reference: Tom White’s Hadoop: The Definitive Guide
21. Traditional RDBMS: Move Data to Compute
As you process more and more data, and you want interactive response
• Typically need more expensive hardware
• Failures at the points of disk and network can be quite problematic
It’s all about ACID: atomicity, consistency, isolation, durability
Can work around this problem with more expensive HW and systems
• Though distribution problem becomes harder to do
22. Hadoop (and NoSQL in general) follows the Map Reduce framework
• Developed initially by Google -> Map Reduce and Google File system
• Embraced by community to develop MapReduce algorithms that are very robust
• Built Hadoop Distributed File System (HDFS) to auto-replicate data to multiple nodes
• And execute a single MR task on all/many nodes available on HDFS
Use commodity HW: no need for specialized and expensive network and disk
Not so much ACID, but BASE (Basically Available, Soft state, Eventually consistent)
Hadoop / NoSQL: Move Compute to the Data
23. // Sample Generated Log
588.891.552.388,-,08/05/2011,11:00:02,W3SVC1,CTSSVR14,-,-,0,-
,200,-,GET,/c.gif,Mozilla/5.0 (Windows NT 6.1; rv:5.0)
Gecko/20100101 Firefox/5.0,http://foo.bar.com/cid-
4985109174710/blah?fdkjafdf,[GUID],-,-
,&Page=blah&Hierarchy=2®ion=Z1&IsoCy=BR&Lang=1046&bxr=…
select
parse_url(concat("http://www.blah.com?", parameters), 'QUERY', 'IsoCy'),
parse_url(concat("http://www.blah.com?", parameters), 'QUERY', 'Lang'),
count(distinct GUID)
from ctslog_sample
group by
parse_url(concat("http://www.blah.com?", parameters), 'QUERY', 'IsoCy'),
parse_url(concat("http://www.blah.com?", parameters), 'QUERY', 'Lang'),
HiveQL: SQL-like language
• Write SQL-like query which becomes
MapReduce functions
• Includes functions like parse_url and
concat so one can perform parsing
functions in HiveQL
Query a web log using HiveQL
24. But how FAST are we, when we achieve it?
The precarious balance between scale and
performance is going to get even more important.
What do you want?
1. Guaranteed response, but get it slow
2. Fast response, but not always
BIA-408-A | SQLCAT: Tier-1 BI in the world of Big Data 24
25. ETL: THE BIG SHUFFLE
Session Code | Session Title 25
26. Our Ideal, scalable world
1-1000
“Logical” Table
1001-2000
2001-3000
3001-4000
Nice
and friendly
Source
28. More Reality… Sources Are Not Nice…
1-1000
“Logical” Table
1001-2000
2001-3000
3001-4000
1,1001,2001
3,1003,2003..
4,1004,2004..
2,1002,2002..
Etc…
29. Investment Bank Architecture – First stab
BIA-408-A | SQLCAT: Tier-1 BI in the world of Big Data 29
BigData
Cluster
Batches
Batches
Batches
“Golden”
Source
AS Cube
1:1
1:1
1:1
1-3M rows/sec
30. Sort/Merge Buffer
Zooming in on the Merge Problem
BIA-408-A | SQLCAT: Tier-1 BI in the world of Big Data 30AS Cube
Batch 1
Batch n
Give me Book X!
X
X
Batch 2
Batch 3
X
X
1:1
1:1
1:1
1:1
31. The big shuffle!
0
1
2
3
hash
ETL Unit
Calc. Hash
Distribute
ETL Unit
Calc. Hash
Distribute
• Each unit operates on a subset of the data
• Computation is distributed
• Database does the minimum work, focus on an optimized user
model!
• Equal sized partitions after the merge (the merge is still there)
ETL Unit
Calc. Hash
Distribute
32. Investment Bank architecture – Better!
BIA-408-A | SQLCAT: Tier-1 BI in the world of Big Data 32
BigData
Cluster
Batches
Batches
Batches
“Golden”
Source
AS Cube
Hash 3
Hash 2
Hash 1
3M rows/sec
(Current)
X20
throughput
33. Shuffle Speed Tests
BULK Inbound Speed to SQL Server SMP
• >3GB/sec
Outbound from SQL Server: 40M rows/sec
• ... Or saturating 4 x 10Gbit NIC one way
When you have shuffled:
Using standard relational / MDX functionality to ad-
hoc query subset of BigData
High concurrency access at low CPU cost
BIA-408-A | SQLCAT: Tier-1 BI in the world of Big Data 33
34. Network as the new Barrier?…
BIA-408-A | SQLCAT: Tier-1 BI in the world of Big Data 34
36. Hive Connector: First Step in Integration
with our BI Platform
New Hive ODBC driver
Leverage Hadoop for Map Reduce, text mining, statistical analysis, etc.
Get Hadoop data into AS, RS, PowerPivot using HiveQL
HDFS
Map Reduce
Hive
AS Tabular AS Multidimensional
Crescent Excel
PowerPivot
Analytical Apps
SQL Engine
PDW
RS
37. Summary: The Challenge Ahead
BIA-408-A | SQLCAT: Tier-1 BI in the world of Big Data 37
Cube
This ...
”Mart” / EDW
F
and this..., this...
...is what we need
to get good at now!
39. Yahoo! manages a
powerful scalable
advertising exchange
that includes publishers
and advertisers
Yahoo! TAO Business Challenge
40. Advertisers want to get
the best bang for their
buck by reaching their
targeted audiences
effectively and efficiently
Yahoo! TAO Business Challenge
41. Yahoo! needs visibility into how consumers
are responding to ads along many
dimensions: web sites, creatives, time of
day, gender, age, location to make the
exchange work as efficiently and
effectively as possible
Yahoo! TAO Business Challenge
42. Yahoo! TAO Technical Requirements
680,000,000Visitors to Yahoo! Branded sites:
Ad Impressions: 3,500,000,000(perday)
Refresh Frequency: Hourly
464,000,000,000(perqtr)Rows Loaded:
Average Query Time: <10 seconds
43. Yahoo! TAO Platform Architecture
How did we load so much so quickly?
Data Archive & Staging
Oracle 11G RAC
File 1
File 2
File N
Partition 1
Partition 2
Partition N
Partition 1
Partition 2
Partition N
24TB
Cube
/qtr
1.2TB
/day
135GB/day
compressed
2PB
cluster
Data Aggregation & ETL
Hadoop
BI Server
SQL Server Analysis
Services 2008 R2
44. PartitionsPartitions
Yahoo Example – “Fast” Oracle Load
• Data is streamed in to Oracle to files
• To get max processing, 30 threads are fired because all T (temp) partitions are
processed concurrently
• Super fast data loads
• Problem is that it requires constant merging of partitions
Files are streamed in
as they become
available
10/10/10 T360772
10/10/10 T360773
…
10/10/10 T361645
10/10/10 T360772
Oracle 10g
10/10/10 T360773
10/10/10 T361645
…
10/10/10 T360772
10/10/10 T360773
10/10/10 T361645
…
SSAS
10/10/10
Merge
45. Partitions – Directly Merging
Partitions
10/10/10 00:00
Oracle 10g
10/10/10 01:00
10/10/10 23:00
…
• New model allows for set hourly partitions
• No more streaming data but with hourly partitions, cannot have as many threads for
fast data loads, unless…
• Process multiple cubes or measure groups in parallel
Partitions
10/10/10 00:00
10/10/10 01:00
10/10/10 23:00
…
SSAS
Segments
10/10/10 00:00
10/10/10 01:00
10/10/10 23:00
…
Activities
10/10/10 00:00
10/10/10 01:00
10/10/10 23:00
…
Uniques
46. BI Query Servers
SQL Server Analysis
Services 2008 R2
24TB
Cube
/qtr
Adhoc Query/Visualization
Tableau Desktop 6
Optimization Application
Custom J2EE App
Yahoo! TAO Platform Architecture
Queries at the “speed of thought”
464B rows of
event level data
/qtr
• Dimensions: 24
• Attributes: 247
• Measures: 207
Avg Query Time:
6 secs
Avg Query Time:
2 secs
47. Yahoo! TAO Return on Investment
For campaigns
optimized using TAO,
advertisers spent 15%
more with Yahoo! than
before
For campaigns
optimized using TAO,
eCPMs (revenue)
has more than
doubled!
48. Yahoo! TAO Return on Investment
Yahoo! TAO exposed customer segment
performance to campaign managers and
advertisers for the first time! No longer
“flying audience blind”
49. Yahoo! TAO Future Direction
2xIncrease Daily Ad Impressions:
5xIncrease consumer segments:
Distinct Count
Hadoop to SSASNew Complexity:
New technologies:
Denali: Apollo,
VertiPaq, and Crescent
51. Big Data and Analytics
• Later this year
• HiveODBC driver
• Hadoop-to-SQL/PDW connectors
• Hadoop on Windows Azure
• Mid-next year
• Hadoop on Windows Server
BIA-408-A | SQLCAT: Tier-1 BI in the world of Big Data 51
52. Complete the Evaluation Form
to Win!
Win a Dell Mini Netbook – every day – just for submitting
your completed form. Each session evaluation form
represents a chance to win.
Pick up your evaluation form:
• In each presentation room
• Online on the PASS Summit website
Drop off your completed form:
• Near the exit of each presentation room
• At the Registration desk
• Online on the PASS Summit website
Sponsored by Dell
52BIA-408-A | SQLCAT: Tier-1 BI in the world of Big Data
53. 53BIA-408-A | SQLCAT: Tier-1 BI in the world of Big Data
Microsoft SQL
Server Clinic
Work through your
technical issues with SQL
Server CSS & get
architectural guidance from
SQLCAT
Microsoft
Product Pavilion
Talk with Microsoft SQL
Server & BI experts to
learn about the next
version of SQL Server and
check out the new
Database Consolidation
Appliance
Expert Pods
Meet Microsoft SQL
Server Engineering team
members &
SQL MVPs
Hands-on Labs
Get experienced through
self-paced & instructor-led
labs on our cloud based lab
platform - bring your laptop
or use HP provided
hardware
Room 611 Expo Hall 6th Floor Lobby Room 618-620
54. October 11-14, Seattle, WA
Thank you
for attending this session and the
2011 PASS Summit in Seattle
Notes de l'éditeur
Invitation to leave
The number of ad performance factors (i.e. dimensions) and the number of ad impressions per day is huge
Yahoo! branded sites attract 680 million unique visitors worldwide
3.5B performance display ad impressions served on Yahoo! exchange per day
Large many to many relationships (consumers can be a member of more than one segment)
Each consumer is a member of an average of 10 segments – explodes the data by 10x
161B rows per quarter for impression data
203B rows per quarter for segment data (compressed but # of rows processed is really 10x = 2 trillion)
Given the number of permutations, query performance needs to be speed of thought or the system is useless
Traditional ROLAP is too slow
Hundred of dimensions, attributes and metrics create complexity
Need integration with good visualization tools to find relevant trends and performance improvement opportunities
Data needs to be fresh (from ad impression to query in less than 24 hours) or opportunities are lost
Display ad campaigns have very short timeframes (< 2 weeks)
The number of ad performance factors (i.e. dimensions) and the number of ad impressions per day is huge
Yahoo! branded sites attract 680 million unique visitors worldwide
3.5B performance display ad impressions served on Yahoo! exchange per day
Large many to many relationships (consumers can be a member of more than one segment)
Each consumer is a member of an average of 10 segments – explodes the data by 10x
161B rows per quarter for impression data
203B rows per quarter for segment data (compressed but # of rows processed is really 10x = 2 trillion)
Given the number of permutations, query performance needs to be speed of thought or the system is useless
Traditional ROLAP is too slow
Hundred of dimensions, attributes and metrics create complexity
Need integration with good visualization tools to find relevant trends and performance improvement opportunities
Data needs to be fresh (from ad impression to query in less than 24 hours) or opportunities are lost
Display ad campaigns have very short timeframes (< 2 weeks)
Who pays for the sorting?
Like the NYSE, the Yahoo! ad network behaves like an exchange for display advertising
Advertisers are the buyers
Publishers (web sites) are the sellers (Yahoo! is one of the publishers)
Yahoo! needs to create the most efficient exchange as possible
Performance display advertiser requires that we can:
Identify the target audience for a campaign
Monitor how they behave across a number of different dimensions
Huge opportunity for optimization but difficult given the large number of discrete dimensions
The number of ad performance factors (i.e. dimensions) and the number of ad impressions per day is huge
Yahoo! branded sites attract 680 million unique visitors worldwide
3.5B performance display ad impressions served on Yahoo! exchange per day
Large many to many relationships (consumers can be a member of more than one segment)
Each consumer is a member of an average of 10 segments – explodes the data by 10x
161B rows per quarter for impression data
203B rows per quarter for segment data (compressed but # of rows processed is really 10x = 2 trillion)
Given the number of permutations, query performance needs to be speed of thought or the system is useless
Traditional ROLAP is too slow
Hundred of dimensions, attributes and metrics create complexity
Need integration with good visualization tools to find relevant trends and performance improvement opportunities
Data needs to be fresh (from ad impression to query in less than 24 hours) or opportunities are lost
Display ad campaigns have very short timeframes (< 2 weeks)
Key design concepts are:
Use standard, off the shelf parts
Loosely coupled components (using a pull architecture)
Centralize data aggregation on grid using Hadoop
Leverage Oracle’s external table feature to make data available to SSAS with minimal latency
One to one match of SASS partitions to Oracle partitions so not aggregation needed & partition pruning enabled (30+ trillion rows in Oracle tables)
Maximize parallel loading (90+ threads loading in parallel)
Separate cube building from cube querying
Improvements in HW/Design
9h -> 2.5h: Change in HW: IBM x3560 M3 256GB RAM, 48 cores; EMC Clariion SAN
2.5h -> 1.25h: Use of Data Direct / Attunity drivers
Cube is complex due to nature of the ad business
Need to provide an “anything by anything” query environment to find the optimization opportunities
If queries aren’t fast, we lose the value
Need to update the cube continuously given that there’s limited time to optimize a display ad campaign (data needs to be updated 4x day at minimum)
Used SASS aggregations extensively – cut down on Hadoop aggregations dramatically
Only 8 fact tables loaded (4 areas, 1 detail, 1 aggregate)
As opposed to an existing ROLAP application at Yahoo! that requires 3,600 facts (aggregate) tables
Doubled the eCPM (revenue) by allowing our campaign managers to “tune” campaign targeting and creatives
Drove increase in spend from advertisers since they got better performance by advertising through Yahoo!
Include all Yahoo! network display ads (additional 3.5B ad impressions) – doubles the number of impressions
Branded Display
Performance Display
Increase the number of consumer segments tracked by 5x (from 50 to 256)
Add unique user (distinct count) metrics for anything by anything queries
Load data into cube directly from Hadoop (skip Oracle load)
Leverage SQL Server Denali Vertipaq & Crescent
Like the NYSE, the Yahoo! ad network behaves like an exchange for display advertising
Advertisers are the buyers
Publishers (web sites) are the sellers (Yahoo! is one of the publishers)
Yahoo! needs to create the most efficient exchange as possible