Contenu connexe
Similaire à [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene (20)
Plus de OpenStack Korea Community (20)
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
- 2. 2© AppliedMicro Proprietary & Confidential
AppliedMicro X-Gene® Processor Philosophy
• Few workloads are compute bound
– Most are limited by memory capacity, bandwidth, or I/O
– HPC workloads are better served by GPGPU
• Scale-out versus scale-up
– High density
– Performance per Watt
– Performance per $
• Balance
– Strong CPU with an optimized ARMv8 core
– Large memory – adequate memory is not an upsell
– Low power – power efficiency is not an upsell
• Open Source
– Open Source Software
– Open Source Hardware
- 3. 3© AppliedMicro Proprietary & Confidential
X-Gene® 1 Processor
Fully Integrated Server-on-a-Chip
• 8 Custom ARMv8 64-bit Cores
– Up to 2.4 GHz
– 8MB shared L3 cache
• Integrated Memory Controllers
– 4 channel DDR3-1600
• Integrated Networking
– Dual 10 Gb Ethernet SFP+
– Quad 1 Gb Ethernet (SGMII)
• Integrated Storage
– 6 lanes of Serial-ATA 3
• Integrated I/O Interfaces
– 17 lanes of PCI-Express® gen3
– 5 controllers
• 45 Watt TDP
Coherent Network
I/O Network
72-Bit
DDR3
72-Bit
DDR3
72-Bit
DDR3
72-Bit
DDR3
8MB
Shared L3
Serial ATA 3PCI-Express® 3
(x17/ 5 controllers)
USB
2.0
10 Gb Ethernet (2)
1 Gb Ethernet (4)
GPIO
SPI
UART
I2C
ARM
Cortex M3
I/D Cache
System Management
Power Management
PKE Engine
ARM v8
L1 I$
L1 D$
ARM v8
L1 I$
L1 D$
L2 Cache
PMD
ARM v8
L1 I$
L1 D$
ARM v8
L1 I$
L1 D$
L2 Cache
PMD
ARM v8
L1 I$
L1 D$
ARM v8
L1 I$
L1 D$
L2 Cache
PMD
ARM v8
L1 I$
L1 D$
ARM v8
L1 I$
L1 D$
L2 Cache
PMD
I/OBridge
In Production
- 4. 4© AppliedMicro Proprietary & Confidential
X-Gene® 2 Processor
Scale-Out Optimized Server-on-a-Chip
• 8 Custom ARMv8 64-bit Cores
– Up to 2.8 GHz
– Expanded instruction set
– 10% higher performance
• Genome Coherent Clustering
• Integrated Memory Controllers
– 4 channel DDR3-1866
• Integrated Networking
– Dual 10 Gb Ethernet (KR)
– RDMA over Ethernet support
• Integrated Storage
– 6 channel Serial-ATA 3
• Integrated I/O Interfaces
– x8 PCI-Express®
– Full I/O virtualization (SMMU)
• 35 Watt TDP
– 50% higher performance / Watt
Coherent Network
I/O Network
72-Bit
DDR3
72-Bit
DDR3
72-Bit
DDR3
72-Bit
DDR3
8MB
Shared L3
Serial ATA 3PCI-Express® 3
(x8/ 3 controllers)
USB
2.0
10 Gb Ethernet (2)
1 Gb Ethernet (4)
GPIO
SPI
UART
I2C
ARM
Cortex M3
I/D Cache
System Management
Power Management
PKE Engine
ARM v8
L1 I$
L1 D$
ARM v8
L1 I$
L1 D$
L2 Cache
PMD
ARM v8
L1 I$
L1 D$
ARM v8
L1 I$
L1 D$
L2 Cache
PMD
ARM v8
L1 I$
L1 D$
ARM v8
L1 I$
L1 D$
L2 Cache
PMD
ARM v8
L1 I$
L1 D$
ARM v8
L1 I$
L1 D$
L2 Cache
PMD
I/OBridge
Sampling Now
- 6. 6© AppliedMicro Proprietary & Confidential
Primary Workloads
Web Tier
Web Serving / Proxy | Apache, NGINX, HAProxy
Web Apps / Hosting | Drupal, WordPress, Rails
Web Caching | Memcached, Redis, Squid
Database | MySQL, MongoDB, Cassandra
Cold Storage
Big Data
Data Analytics
Cold Storage | CEPH, GlusterFS, Openstack Swift
Big Data | Hadoop MapReduce, Spark
Data Analytics | Lucene, ElasticSearch, Hive
HPC HPC | CPU / GPU combination workloads
- 7. 7© AppliedMicro Proprietary & Confidential
AppliedMicro ARMv8 Core Performance
Frequency-Independent
3.4
4.2
4.6
DhrystoneMIPs/MHz/Core
1.0
2.0
3.0
4.0
5.0
6.0
5.0
6.5
AtomC2750
CortexA-57
XeonE3(Haswell–22nm)
X-Gene®2
X-Gene®1
Core Performance:
Must be competitive… but it does not
tell the full story
Up to 40% faster than
Intel Atom
Up to 80% of the
performance of
Xeon®… but with large
memory and lower
power
X-Gene 1
8-core
128 GB
45 Watts
X-Gene 2
8-core
128 GB
35 Watts
Xeon E3
4-core
32 GB
80 Watts
Up to 10% faster than
ARM Cortex A-57…
but higher frequency
- 8. 8© AppliedMicro Proprietary & Confidential
Enterprise Workload Performance
Web Server (WRK Benchmark)
AppliedMicro
X-Gene® 2
Intel Xeon®
E5-2630v3
1038
771
2.4
4.4
8.5
6.3
Bandwidth
(higher is better)
Latency
(lower is Better)
Performance
(higher is better)
KRPS
KRPS
ms
ms
Gbps
Gbps
X-Gene 2 (8c @ 2.4 GHz)
• 4 node 1U / ½ width sled
• 64GB DDR3-1600
• 4 x 10GbE (integrated)
• Wall power: ~190 Watts
Xeon e5-2630v3 (8c/16t @ 2.4 GHz)
• 2P 1U / ½ width sled
• 64GB DDR4-2133
• 4 x 10GbE (NIC)
• Wall power: ~180 Watts
Up to 35% Higher Performance | Lower TCO
Standard CPU benchmarks do not always translate to delivered
workload performance
- 9. 9© AppliedMicro Proprietary & Confidential
MongoDB Performance with YCSB
Real World In-Memory Database Workload
1U / 2P Rack Server
Intel™ Xeon® E5-2630v3
• 16C/32T 2.4GHz Turbo/HT
• 64GB DDR4-1866
2-port 10GbE Mellanox NIC
CentOS 7.1
24-port 10GbE
Netgear™ Switch
Client
Intel™ Xeon® 2P e5-2630v2
• 12C/24T 2.6GHz Turbo /HT
• 64GB DDR3-1600
2-port 10GbE Mellanox NIC
CentOS 7.1
HP Moonshot m400
AppliedMicro X-Gene® CPU
• 8-Core 2.4GHz
• 64GB DDR3-1333
10GbE Integrated Ethernet
RHELSA 7.1
Hardware Topology
• Server Command
• mongoDB version 2.4.9
• compile options scons mongodb –usev8=false
• Run options ./mongod --dbpath /root/mongod/data4 --nojournal --
logpath=/root/tuan/mongod4_log --port 27021 --logappend
* Benchmark Command
./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://10.66.12.206:27017 –threads $threads
./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://10.66.12.206:27018 –threads $threads
./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://10.66.12.206:27019 –threads $threads
./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://10.66.12.206:27020 –threads $threads
./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://10.66.12.206:27021 –threads $threads
*YCSB from github at commit 5ab241
1x10GbE
1x10GbE
1x10GbE
1GbE
- 10. 10© AppliedMicro Proprietary & Confidential
Single Moonshot m400 Cartridge =
~50% performance of a full 1U 2P
Intel Xeon® E5 Haswell server
MongoDB Performance with YCSB
Real World In-Memory Database Workload
36
48
89
109
55
100
188
252
0
50
100
150
200
250
300
1 thread 2 threads 5 threads 10 threads
Performanceops/sec(Thousands)
Throughput
HP Moonshot m400
2P E5-2630v3 Haswell
25
48
90
100
7
14
30
38
0
20
40
60
80
100
120
1 thread 2 threads 5 threads 10 threads
CPUUtilization(%)
CPU Utilization
HP Moonshot m400
2P E5-2630v3 Haswell
- 11. 11© AppliedMicro Proprietary & Confidential
MongoDB Performance with YCSB
Real World In-Memory Database Workload
42URack
9Moonshotm400chassis/rack
402PE5-2630v3Haswell1Uservers/rack
Rack-level Scalability =
5x the Performance versus full 1U 2P
Intel Xeon® E5 Haswell rack
15
19
36
44
2
4
8
10
0
5
10
15
20
25
30
35
40
45
50
1 thread 2 threads 5 threads 10 threads
Performanceops/sec(Millions)
Rack Level Throughput
HP Moonshot m400
2P E5-2630v3 Haswell
- 12. 12© AppliedMicro Proprietary & Confidential
1U / 2P Rack Server
Intel™ Xeon® E5-2630v3
• 16C/32T 2.4GHz Turbo/HT
• 64GB DDR4-1866
2-port 10GbE Mellanox NIC
CentOS 7.1
24-port 10GbE
Netgear™ Switch
Client
Intel™ Xeon® 2P e5-2630v2
• 12C/24T 2.6GHz Turbo /HT
• 64GB DDR3-1600
2-port 10GbE Mellanox NIC
CentOS 7.1
HP Moonshot m400
AppliedMicro X-Gene® CPU
• 8-Core 2.4GHz
• 64GB DDR3-1333
10GbE Integrated Ethernet
RHELSA 7.1
Hardware Topology
• Server Command
• PostgreSQL version 9.4.4
• compile options ./configure
Run options su postgres -c '/usr/local/pgsql/bin/postgres -F -D
/home/postgres/data -p 5432 &'
Postgres Performance with BenchmarkSQL
Real World In-Memory Database Workload
PostgreSQL Database Contents
32 Warehouses with 100,000 parts inventory/warehouse
10 districts/warehouse
3000 customers/district
1 terminal/district = 1 operator/district
to serve 3000 customers
Total Customers = 32*10*3000 = 960,000
PostgreSQL Database Operations
New Order = 45%
Payment = 43%
Order Status = 4%
Delivery = 4%
Stock Level = 4%
• Benchmark Command
• BenchmarkSQL version 4.1.0
cd ~/benchmarksql-4.1.0/run
./runBenchmark.sh props.pg
progs.pg file content
driver=org.postgresql.Driver
conn=jdbc:postgresql://10.76.191.182:5432/postgres
user=benchmarksql
password=amcc1234 warehouses=32 terminals=8
runTxnsPerTerminal=0 runMins=2 limitTxnsPerMin=0
newOrderWeight=45 paymentWeight=43 orderStatusWeight=4
deliveryWeight=4 stockLevelWeight=4
1x10GbE
1x10GbE
1x10GbE
1GbE
- 13. 13© AppliedMicro Proprietary & Confidential
Single Moonshot m400 Cartridge =
~90% performance of a full 1U 2P
Intel Xeon® E5 Haswell server
Postgres Performance with BenchmarkSQL
Real World In-Memory Database Workload
8
15
25
38
52
70 69
4 6
10
18
36
76
120
0
20
40
60
80
100
120
140
1 2 4 8 16 32 64
TransactionsPerMinute(Thousands)
Terminals (1 terminal = 1operator serving 3000 customers/district)
Throughput
HP Moonshot m400
2P E5-2630v3 Haswell
8%
17%
30%
49%
70%
97% 100%
1% 2% 3% 5% 10%
18%
28%
0%
20%
40%
60%
80%
100%
120%
1 2 4 8 16 32 64
CPUUtilization(%)
Terminals (1 terminal = 1operator serving 3000
customers/district)
CPU Utilization
HP Moonshot m400
2P E5-2630v3 Haswell
- 14. 14© AppliedMicro Proprietary & Confidential
Rack Level Scalability =
8X-9X the performance versus full 1U 2P Intel Xeon® E5 Haswell rack
Postgres Performance with BenchmarkSQL
Real World In-Memory Database Workload
3170
6066
10106
15255
21171
28329 28143
146 222 381 724
1458
3023
4783
0
5000
10000
15000
20000
25000
30000
1 2 4 8 16 32 64
TransactionsPerMinute(Thousands)
Terminals (1 terminal = 1 operator serving 3000 customers/district)
Rack Level Throughput
HP Moonshot m400
2P E5-2630v3 Haswell
42URack
9Moonshotm400chassis/rack
402PE5-2630v3Haswell1Uservers/rack
- 16. 16© AppliedMicro Proprietary & Confidential
Moor Insights and Strategy Paper
“The First Enterprise Class 64-Bit ARMv8 Server: HP Moonshot System’s HP ProLiant m400 Server Cartridge”
• HP ProLiant™ m400 “Moonshot” Cartridge
– Production shipments as of 4Q, 2014
• AppliedMicro X-Gene® Processor
– First production 64-bit ARMv8-based server SoC
– Server-class performance with mobile efficiency
• Web Tier/Caching Solution
– Service providers and commercial internet providers
– Early adopters of ARM servers
– ARM server/mobile software development community
Copyright ©2014 Moor Insights & Strategy All Rights Reserved
Moor Insights and Strategy Whitepaper outlining product details and
TCO analysis: http://www.moorinsightsstrategy.com/?p=4753
35% Lower TCO for Scale-Out Web-Tier / Caching Environments
- 17. 17© AppliedMicro Proprietary & Confidential
X-Gene® Technology in the Public Cloud
"This is probably the first example of Moonshot AArch64 running in
Europe outside of HP’s development labs, and certainly the first
example of generally available Moonshot backed AArch64 instances in
an OpenStack public cloud anywhere in the world.”
Dr. Mike Kelly, CEO and founder of DataCentred
- 18. 18© AppliedMicro Proprietary & Confidential
Ethernet
Switch
X-
Ge
ne
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
X-
Ge
ne
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
X-
Ge
ne
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
X-
Ge
ne
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
X-
Ge
ne
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
X-
Ge
ne
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
X-
Ge
ne
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
X-
Ge
ne
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
Xeon
E5-2660 v3
Xeon
E5-2660 v3
IO
Chipset
10G
NIC
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
D
I
M
M
Xeon ® 1U Server X-Gene ® 1U Server
No of Xeon® vCPUs 32 No of X-Gene® 2 ARM vCPUs 64
General Purpose Instances 1x
m3.medium, m3.large, m3.xlarge
General Purpose Instances 2x
m3.medium, m3.large, m3.xlarge
Memory Optimized Instances 1x
r3.large, r3.xlarge, r3.2xlarge
Memory Optimized Instances 2x
r3.large, r3.xlarge, r3.2xlarge
X-Gene® 2 vCPU: 2x density per 1U server
- 19. 19© AppliedMicro Proprietary & Confidential
X-Gene® 2 v/s Intel Xeon® : Rack Level TCO
2x density; 45-50% Instance Cost Reduction
Xeon® Rack X-Gene® 2 Rack
Rack Power: 13KW Rack Power: 13KW
General Purpose Instances
Rack Cost: $145K Rack Cost: $145K
No of vCPUs: 1280 No of vCPUs: 2560
Instance Cost: 1x Instance Cost: 0.5x
Rack Power: 14KW Rack Power: 16KW
Memory Optimized Instances
Rack Cost: $170K Rack Cost: $190K
No of vCPUs: 1280 No of vCPUs: 2560
Instance Cost: 1x Instance Cost: 0.55x
X-Gene® 2 ARM Instances at 45 – 50% lower prices than Xeon® Instances
- 21. 21© AppliedMicro Proprietary & Confidential
Storage
Languages
Application Workloads
Web Tier & Storage
Web Proxy
Web Apps
Web Caching
Database
Web Server
Swift
Cinder
- 22. 22© AppliedMicro Proprietary & Confidential
NFV Update
SAE
PGW
SGW
GTP
IP
OVS
Open Source
Linux
X-Gene 1
Evaluation Platform
Evolved
Packet Core
(EPC)
Tieto TIP
Stack
APM
Software
Traffic Generator
Traffic
Generator
GUI
• AppliedMicro Working with Tieto
to Port TIP Stack to X-Gene
• Demo of EPC and TIP using OVS
with Traffic Generator
• Multiple Networking Functions
Virtualized
– SAE Core
– Packet Data Network Gateway
– Serving Gateway
• Schedule
– Completion December, 2015
- 25. 25© AppliedMicro Proprietary & Confidential
AppliedMicro Genome™ Platform
New Platform for X-Gene® 2 Technology
• Memory coherent framework for scale-out computing
– Enables the positives of scale-up in a scale-out platform
– Utilizes PCIe and interconnect 10 GbE fabric
– Bare metal hypervisor software approach
• Avoids hardware SMP complexity and cost
• Benefits
– More performance per node to address more workloads
• More cores & more memory
• 4 or more X-Gene processors per node
– Single IP address across all processors in a node
– Single Operating System image across all processors in a node
– No software modification required
- 26. 26© AppliedMicro Proprietary & Confidential
Genome™ Coherent Solution
Genome™ RDMA / PCIe® Low Latency Coherent Fabric
8-core X-Gene 2
32 GB DDR3
Application Software
32-core SMP Linux
8-core X-Gene 2
32 GB DDR3
8-core X-Gene 2
32 GB DDR3
8-core X-Gene 2
32 GB DDR3
PCIe Gen3 Fabric
1TB
10 Gigabit Ethernet Fabric
10 GbE
Switch
PCIe
Switch
2 x 10GbE
to TOR
Switch
BMC
Scalable performance across multiple sockets
Customizable for each application – configurable nodes based on need
Lowest power and price for the target application
½ width 1U, 4 node reference implementation
Options: more nodes, more memory
- 27. 27© AppliedMicro Proprietary & Confidential
X-Gene® 2 Genome™ Development Platform
Gryphon
• Platform
– 4 X-Gene 2 processor nodes
– Up to 2.8 GHz
– 2 DDR3 channels / node
• Up to 64 Gbytes / node (2 DIMMs)
• DDR3-1866
• Form Factor
– 1U ½ width
– Dimensions:
• Depth: 27.5” (698mm)
• Width: 17.64” (448mm)
• Height: 1.75” (44.5mm)
• Features
– 1 SATA HDD/SSD / node
– 16-port 10GbE switch / sled
• 2x10G XFI to top-of-rack switch
– 1GbE management port / sled
– 8-port PCI-Express® Gen 3 switch
– BMC
• IPMI v2.0
– On board thermal sensors
• Power
– 220 Watts