TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Introduction to Warehouse-Scale Computers
1. Warehouse-Scale
Computers
CS4342 Advanced Computer Architecture
Dilum Bandara
Dilum.Bandara@uom.lk
Slides adapted from “Computer Architecture, A Quantitative Approach” by John L.
Hennessy and David A. Patterson, 5th Edition, 2012, MK Publishers and
The Datacenter as a Computer:An Introduction to the Design of Warehouse-Scale
Machines by Luiz André Barroso & Urs Hölzle
7. Warehouse-Scale Computer (WSC)
Provides Internet services
Search, social networking, online maps, video sharing,
online shopping, email, cloud computing, etc.
Differences with HPC clusters
Clusters use higher performance processors & network
Clusters emphasize thread-level parallelism, WSCs
emphasize request/task-level parallelism
Differences with datacenters
Datacenters consolidate different machines & software
into a single location
Datacenters emphasize virtual machines & hardware
heterogeneity to serve varied customers 7
8. Design Factors for WSC
Cost-performance
Small savings add up
Energy efficiency
Affects power distribution & cooling
Work per joule
Operational costs count
Power consumption is a primary constraint when
designing a system
Dependability via redundancy
Many low-cost components
8
9. Design Factors (Cont.)
Network I/O
Interactive & batch processing workloads
Web search – interactive
Web indexing – batch
Ample computational parallelism isn’t important
Most jobs are totally independent, “Request-level
parallelism”
Scale – Its opportunities & problems
Can afford to build customized systems as WSC
require volume purchase
Frequent failures
9
10. Failure Example
Consider a WSC with 50,000 nodes. MTTF of a node is 5
years. How many failures be there for a day?
MTTF in days = 5 x 365 = 1,825
Failure rate = 1/1,825 per day
No of failures per day = 50,000/1,825 = 27.4
Consider a WSC with 50,000 nodes & each node with 4
hard disks. Suppose a annual failure rate of a disk is 4%.
What is the time for a disk failure?
No of disks = 50,000 x 4 = 200,000
No of failures per year = 200,000 x 0.04 = 8,000
Time for failure = 365 x 24 / 8,000 = 1.095 hours/failure 10
11. Programming Models & Workloads
Batch processing framework
– MapReduce
Map
Applies a programmer-
supplied function to each
logical input record
Runs on thousands of
computers
Provides new set of (key,
value) pairs as intermediate
values
Reduce
Collapses values using
another function 11
http://www.cbsolution.net/techniques/ontarget/mapredu
ce_vs_data_warehouse
14. Programming Models & Workloads
(Cont.)
MapReduce runtime environment schedules
map & reduce task to WSC nodes
Availability
Use replicas of data across different servers
Use relaxed consistency
No need for all replicas to always agree
Workload demands
Often vary considerably
14
15. Computer Architecture of WSC
Often uses a hierarchy of networks for
interconnection
Each 19” rack holds 48 1U servers connected to
a rack switch
Rack switches are uplinked to a switch(es)
higher in hierarchy
Uplink has 48/n times lower bandwidth –
Oversubscription
n – No of uplink ports
Goal is to maximize locality of communication relative
to the rack
15
19. Infrastructure & Costs
Location
Proximity to Internet backbones, electricity cost, property tax rates,
low risk from earthquakes, floods, & hurricanes
Power distribution
19
20. Power Usage
20
U.S. EPA Report 2007 – 1.5% of total U.S.
power consumption used by data centers
which has more than doubled since 2000 &
costs $4.5 billion
21. How Many Nodes can a WSC Support?
Each node
“Nameplate power rating” gives maximum power
consumption
To get actual, measure power under actual workloads
Oversubscribe cumulative nodes power by 40%,
but monitor power closely
21
23. Cooling (Cont.)
23
Cooling system also uses water (evaporation & spills)
e.g. 70,000 to 200,000 gallons per day for an 8 MW facility
24. Efficiency
Power Utilization Effectiveness (PUE)
= Total facility power / IT equipment power
≥ 1
Median PUE on 2006 study was 1.69
24
Source: http://hightech.lbl.gov/benchmarking-guides/data-a1.html
25. Performance
Latency is important metric because it is seen by
users
Bing study
Users will use search less as response time
increases
Service Level Objectives (SLOs) & Service Level
Agreements (SLAs)
Typically given at application level
e.g., 99% of requests be below 100 ms
In clouds typically given only for static resources
CPU speed, no of cores, & memory
25
26. Cost
Capital expenditures (CAPEX)
Cost to build a WSC
Hardware cost dominates
Operational expenditures (OPEX)
Cost to operate a WSC
Power for nodes & cooling dominates
26
28. Cloud Computing (Cont.)
WSCs offer economies of scale that can’t be
achieved with a datacenter
5.7 times reduction in storage costs
7.1 times reduction in administrative costs
7.3 times reduction in networking costs
This has given rise to cloud services such as Amazon
Web Services
“Utility Computing”
Based on using open source virtual machine & operating
system software
28
29. Amazon Web Services
Virtual machines
XEN
Very low cost
$ 0.10 per hour per instance
Primary rely on open source software
No (initial) service guarantees
No contract required
Amazon S3
Simple Storage Service
Amazon EC2
Elastic Computer Cloud 29
30. Amazon Web Services – Example
30
http://www.ryhug.com/free-art-available-on-amazon-amazon-web-services-that-is/
Notes de l'éditeur
1U - A rack unit (abbreviated U or RU) is a unit of measure defined as 44.50 mm (1.75 in)
computer room air conditioning (CRAC)
DCiE = 1/PUE
S3 - Simple Storage Service
EC2 - Elastic Compute Cloud