SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
© 2017 Arm Limited
SFO17-314 Optimizing Golang for
High Performance with ARM64
AssemblyWei Xiao
Staff Software Engineer
Wei.Xiao@arm.com
September 27, 2017
Linaro Connect SFO17
© 2017 Arm Limited2
Agenda
• Introduction
• Differences from GNU Assembly
• Integrate assembly into Golang
• Optimize CRC32 for arm64
• Optimize SHA256 for arm64
• Optimize IndexByte for arm64
• Work Summary and Next steps
© 2017 Arm Limited3
Introduction
• Assembly optimization benefits
• Take advantages of ARMv8 capabilities
– Hardware specific instructions (such as SVC, AES, SHA and etc.)
– Vector (Single Instruction Multiple Data) Instructions
• Others
– No need for CGo dependency
– Avoid runtime context switching overhead
– Optimized code (vs Go compiler)
– Faster compilation
© 2017 Arm Limited4
Assembly Optimization Current Status
• Go Standard packages with assembly optimization
crypto/aes crypto/elliptic crypto/internal/cipherhw crypto/md5
crypto/rc4 crypto/sha1 crypto/sha256 crypto/sha512
hash/crc32 math math/big reflect
runtime runtime/cgo runtime/internal/atomicruntime/internal/sys
strings sync/atomic syscall ……
red – arm64 optimization ongoing
black – no arm64 optimization
© 2017 Arm Limited5
Assembly Terminology
• Mnemonic
• CALL, MOVW, MOVD, …
• Register
• R1, F0, V3, …
• Immediate
• $1, $0x100, …
• Memory
• (R1), 8(R3), …
Registers in AArch64
© 2017 Arm Limited6
Instruction Differences from GNU Assembly
• Semi-abstract instruction set (Plan 9 from Bell Labs)
• Architecture independent mnemonics like MOVD
• Some architecture aspects shine through
• Assembler may insert prologues, remove ‘unreachable’
instructions
• Instructions may be expanded by the assembler
• Not all instructions available
• BYTE/WORD/LONG directives to lay down opcodes into
instruction stream directly
1 // func Add(a, b int) int
2 TEXT ·Add(SB),$0-24
3 MOVD arg1+0(FP), R0
4 MOVD arg2+8(FP), R1
5 ADD R1, R0, R0
6 MOVD R0, ret+16(FP)
7 RET
© 2017 Arm Limited7
Operand Differences from GNU Assembly
• Data flow from left to right
• ADD R1, R2 → R2 += R1
• SUBW R12<<29, R7, R8 → R8 = R7 – (R12<<29)
• Memory operands: base + offset
• MOVH (R1), R2 → R2 = *R1
• MOVBU 8(R3), R4 → R4 = *(8 + R3)
• MOVD mypackage·myvar(SB), R8 → R8 = *myvar
• Addresses
• MOVD $8(R1), R3 → R3 = R1 + 8
• MOVD $·myvar(SB), R4 → R4 = &myvar
package mypackage
var myvar int64
Unicode
U+00B7
© 2017 Arm Limited8
Go Assembly Extension for arm64
• Extended register, e.g.: ADD Rm.<ext>[<<amount], Rn, Rd
• Arrangement for SIMD instructions, e.g.: VADDP Vm.<T>, Vn.<T>, Vd.<T>
• Width specifier and element index for SIMD instructions, e.g.: VMOV Vn.<T>[index], Rd
• Register List, e.g.: VLD1 (Rn), [Vt1.<T>, Vt2.<T>, Vt3.<T>]
• Register offset variant, e.g.: VLD1.P (Rn)(Rm), [Vt1.<T>, Vt2.<T>]
• Go assembly for ARM64 reference manual: src/cmd/internal/obj/arm64/doc.go
• Full details
• https://go-review.googlesource.com/c/go/+/41654
© 2017 Arm Limited9
Assembly Build Rule
• Toolchain will select appropriate assembly files according to GOOS+GOARCH
• Using file extensions, e.g.
• sys_linux_arm64.s
• sys_darwin_arm64.s
• Example: assembly files for: hash/crc32
• crc32_amd64p32.s
• crc32_amd64.s
• crc32_arm64.s
• crc32_ppc64le.s crc32_table_ppc64le.s
• crc32_s390x.s
© 2017 Arm Limited10
Prototype
• Function call is the bridge between Go and assembly
• Function declaration
• src/runtime/timestub.go
• func walltime() (sec int64, nsec int32)
• Function assembly implementation
• runtime/sys_linux_arm64.s
package
(optional)
function
name
Flag
(optional)
stack
frame size
arguments
size
(optional)
Middle
dot
© 2017 Arm Limited11
Pseudo-registers
• FP: Frame Pointer
• Points to the bottom of the argument list
• Offsets are positive
• Offsets must include a name, e.g. arg+0(FP)
• SP: Stack Pointer
• Points to the top of the space allocated for local variables
• Offsets are negative
• Offsets must include a name, e.g. ptr-8(SP)
• SB: Static Base
• Named offsets from a global base
Low address
High address
Low address
High address
© 2017 Arm Limited12
Calling Convention
• All arguments are passed on the stack
• Offsets from FP
• Return arguments follow input arguments
• Start of return arguments aligned to pointer size
• All registers are caller saved, except:
• Stack pointer register (RSP)
• G context pointer register (R28)
• Frame pointer (R29)
© 2017 Arm Limited13
arm64 Stack Frame
w/o frame pointer w/ frame pointer
Low address
High address
© 2017 Arm Limited14
Optimize CRC32 for arm64 – Before
• Pure Go table-driven implementation
src/hash/crc32/crc32_generic.go
42 func simpleUpdate(crc uint32, tab *Table, p []byte) uint32 {
43 crc = ^crc
44 for _, v := range p {
45 crc = tab[byte(crc)^v] ^ (crc >> 8)
46 }
47 return ^crc
48 }
© 2017 Arm Limited15
Optimize CRC32 for arm64 – After
• Assembly for arm64
src/hash/crc32/crc32_arm64.s
9 // func castagnoliUpdate(crc uint32, p []byte) uint32
10 TEXT ·castagnoliUpdate(SB),NOSPLIT,$0-36
11 MOVWU crc+0(FP), R9 // CRC value
12 MOVD p+8(FP), R13 // data pointer
13 MOVD p_len+16(FP), R11 // len(p)
14
15 CMP $8, R11
16 BLT less_than_8
17
18 update:
19 MOVD.P 8(R13), R10
20 CRC32CX R10, R9
21 SUB $8, R11
22
23 CMP $8, R11
24 BLT less_than_8
25
26 JMP update
…
46 done:
47 MOVWU R9, ret+32(FP)
48 RET
0(FP)
ret
p.cap
p.len
p.base
crc
32(FP)
8(FP)
16(FP)
© 2017 Arm Limited16
Optimize CRC32 for arm64 – Result
• Optimization with assembly
• 2X-7X speedup
© 2017 Arm Limited17
Optimize SHA256 for arm64
• SHA256 introduction
block rounds K Hash
SHA-256 512bits 64 32bits 32bits 256bits
© 2017 Arm Limited18
Optimize SHA256 for arm64 – Message schedule
src/crypto/sha256/sha256block.go
84 for i := 0; i < 16; i++ {
85 j := i * 4
86 w[i] = uint32(p[j])<<24 | uint32(p[j+1])<<16 | uint32(p[j+2])<<8 | uint32(p[j+3])
87 }
88 for i := 16; i < 64; i++ {
89 v1 := w[i-2]
90 t1 := (v1>>17 | v1<<(32-17)) ^ (v1>>19 | v1<<(32-19)) ^ (v1 >> 10)
91 v2 := w[i-15]
92 t2 := (v2>>7 | v2<<(32-7)) ^ (v2>>18 | v2<<(32-18)) ^ (v2 >> 3)
93 w[i] = t1 + w[i-7] + t2 + w[i-16]
94 }
for i := 16; i < 64; i+=4 {
SHA256SU0 Vn.S4, Vd.S4
SHA256SU1 Vm.S4, Vn.S4, Vd.S4
}
© 2017 Arm Limited19
Optimize SHA256 for arm64 – Hash Computation
src/crypto/sha256/sha256block.go
98 for i := 0; i < 64; i++ {
99 t1 := h + ((e>>6 | e<<(32-6)) ^ (e>>11 | e<<(32-11)) ^ (e>>25 | e<<(32-25))) + ((e & f) ^ (^e & g)) + _K[i] + w[i]
100
101 t2 := ((a>>2 | a<<(32-2)) ^ (a>>13 | a<<(32-13)) ^ (a>>22 | a<<(32-22))) + ((a & b) ^ (a & c) ^ (b & c))
102
103 h = g
104 g = f
105 f = e
106 e = d + t1
107 d = c
108 c = b
109 b = a
110 a = t1 + t2
111 }
for i := 0; i < 64; i+=4 {
SHA256H Vm, Vn, Vd.4S
SHA256H2 Vm, Vn, Vd.4S
}
© 2017 Arm Limited20
Optimize SHA256 for arm64 – Implementation
src/crypto/sha256/sha256block_arm64.s
© 2017 Arm Limited21
Optimize SHA256 for arm64 – Result
• Optimization with assembly
• 2X-16X speedup
© 2017 Arm Limited22
Optimize IndexByte for arm64 – Before
H E L L O W O R L D …
R1R0
R2 D
R0
src/runtime/asm_arm64.s
© 2017 Arm Limited23
Optimize IndexByte for arm64 – After
• Assembly implementation with SIMD
• SIMD instruction: CMEQ Vm.B16, Vn.B16, Vd.B16
Compare 16 bytes in parallel
More details:
• Input slice shorter than 16
• Input slice address not 16-byte aligned
• Input slice size not 16-byte aligned
• Count trailing zeros (not leading zeros)
• Implementation:
• https://go-review.googlesource.com/c/go/+/41654
© 2017 Arm Limited24
Optimize IndexByte for arm64 – Result
• Optimization with SIMD
• 1.5X-8X speedup
© 2017 Arm Limited25
Work Summary
Disassembler (arm64):
https://go-review.googlesource.com/c/arch/+/43651 https://go-review.googlesource.com/c/arch/+/56810 https://go-review.googlesource.com/c/go/+/58930
https://go-review.googlesource.com/c/go/+/56331https://go-review.googlesource.com/c/go/+/49530
Assembler (arm64):
https://go-review.googlesource.com/c/go/+/33594https://go-review.googlesource.com/c/go/+/33595https://go-review.googlesource.com/c/go/+/41511
https://go-review.googlesource.com/c/go/+/41654https://go-review.googlesource.com/c/go/+/45850https://go-review.googlesource.com/c/go/+/54951
https://go-review.googlesource.com/c/go/+/54990https://go-review.googlesource.com/c/go/+/57852https://go-review.googlesource.com/c/go/+/58350
https://go-review.googlesource.com/c/go/+/56030https://go-review.googlesource.com/c/go/+/46438https://go-review.googlesource.com/c/go/+/41653
Optimizations:
https://go-review.googlesource.com/c/go/+/40074https://go-review.googlesource.com/c/go/+/61550https://go-review.googlesource.com/c/go/+/61570
https://go-review.googlesource.com/c/go/+/33597https://go-review.googlesource.com/c/go/+/64490https://go-review.googlesource.com/c/go/+/55610
Others:
https://go-review.googlesource.com/c/go/+/61511https://go-review.googlesource.com/c/go/+/62850https://go-review.googlesource.com/c/go/+/45112
https://go-review.googlesource.com/c/go/+/44390https://go-review.googlesource.com/c/go/+/42971https://go-review.googlesource.com/c/go/+/40511
https://go-review.googlesource.com/c/arch/+/37172
© 2017 Arm Limited26
Next Steps
• Crypto optimizations:
• aes, elliptic, …
• SIMD optimizations:
• strings, bytes, runtime, reflect, …
• Compiler SSA arm64 back-end optimizations
• Others
• Internal arm64 linker
• Tool for arm64: race detector, memory sanitizer, …
• New architecture features
• ...
2727
Thank You!
Danke!
Merci!
谢谢!
ありがとう!
Gracias!
Kiitos!
© 2017 Arm Limited
© 2017 Arm Limited28
CGo
GO ABI C ABI
1 package print
2
3 // #include <stdio.h>
4 // #include <stdlib.h>
5 import "C"
6 import "unsafe"
7
8 func Print(s string) {
9 cs := C.CString(s)
10 C.fputs(cs, 11(*C.FILE)(C.stdout))
12 C.free(unsafe.Pointer(cs))
13 }
CGo
© 2017 Arm Limited29
Useful in
macros!
Branch Difference from GNU Assembly
• On arm64: B is alias for JMP, BL is alias for CALL
Jump to labels
JMP L1
NOP
L1:
NOP
L2: NOP
NOP
B L2
Call and Indirect Jump
BL $p.foo
MOV $p·foo, R3
CALL(R3)
B (R3)
MOV 0(R26), R4
JMP (R4)
Jump relative to PC
JMP 2(PC)
NOP
NOP
NOP
NOP
JMP -2(PC)

Contenu connexe

Plus de Linaro

Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Linaro
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteLinaro
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopLinaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allLinaro
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorLinaro
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMULinaro
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MLinaro
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation Linaro
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootLinaro
 
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...Linaro
 
HKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready ProgramHKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready ProgramLinaro
 
HKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NNHKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NNLinaro
 
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...Linaro
 
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...Linaro
 
HKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: IntroductionHKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: IntroductionLinaro
 
HKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersHKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersLinaro
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightLinaro
 

Plus de Linaro (20)

Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
 
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
 
HKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready ProgramHKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready Program
 
HKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NNHKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NN
 
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
 
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
 
HKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: IntroductionHKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: Introduction
 
HKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersHKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 Servers
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with Coresight
 

Dernier

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 

Dernier (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Optimizing GoLang for High Performance with ARM64 Assembly - SFO17-314

  • 1. © 2017 Arm Limited SFO17-314 Optimizing Golang for High Performance with ARM64 AssemblyWei Xiao Staff Software Engineer Wei.Xiao@arm.com September 27, 2017 Linaro Connect SFO17
  • 2. © 2017 Arm Limited2 Agenda • Introduction • Differences from GNU Assembly • Integrate assembly into Golang • Optimize CRC32 for arm64 • Optimize SHA256 for arm64 • Optimize IndexByte for arm64 • Work Summary and Next steps
  • 3. © 2017 Arm Limited3 Introduction • Assembly optimization benefits • Take advantages of ARMv8 capabilities – Hardware specific instructions (such as SVC, AES, SHA and etc.) – Vector (Single Instruction Multiple Data) Instructions • Others – No need for CGo dependency – Avoid runtime context switching overhead – Optimized code (vs Go compiler) – Faster compilation
  • 4. © 2017 Arm Limited4 Assembly Optimization Current Status • Go Standard packages with assembly optimization crypto/aes crypto/elliptic crypto/internal/cipherhw crypto/md5 crypto/rc4 crypto/sha1 crypto/sha256 crypto/sha512 hash/crc32 math math/big reflect runtime runtime/cgo runtime/internal/atomicruntime/internal/sys strings sync/atomic syscall …… red – arm64 optimization ongoing black – no arm64 optimization
  • 5. © 2017 Arm Limited5 Assembly Terminology • Mnemonic • CALL, MOVW, MOVD, … • Register • R1, F0, V3, … • Immediate • $1, $0x100, … • Memory • (R1), 8(R3), … Registers in AArch64
  • 6. © 2017 Arm Limited6 Instruction Differences from GNU Assembly • Semi-abstract instruction set (Plan 9 from Bell Labs) • Architecture independent mnemonics like MOVD • Some architecture aspects shine through • Assembler may insert prologues, remove ‘unreachable’ instructions • Instructions may be expanded by the assembler • Not all instructions available • BYTE/WORD/LONG directives to lay down opcodes into instruction stream directly 1 // func Add(a, b int) int 2 TEXT ·Add(SB),$0-24 3 MOVD arg1+0(FP), R0 4 MOVD arg2+8(FP), R1 5 ADD R1, R0, R0 6 MOVD R0, ret+16(FP) 7 RET
  • 7. © 2017 Arm Limited7 Operand Differences from GNU Assembly • Data flow from left to right • ADD R1, R2 → R2 += R1 • SUBW R12<<29, R7, R8 → R8 = R7 – (R12<<29) • Memory operands: base + offset • MOVH (R1), R2 → R2 = *R1 • MOVBU 8(R3), R4 → R4 = *(8 + R3) • MOVD mypackage·myvar(SB), R8 → R8 = *myvar • Addresses • MOVD $8(R1), R3 → R3 = R1 + 8 • MOVD $·myvar(SB), R4 → R4 = &myvar package mypackage var myvar int64 Unicode U+00B7
  • 8. © 2017 Arm Limited8 Go Assembly Extension for arm64 • Extended register, e.g.: ADD Rm.<ext>[<<amount], Rn, Rd • Arrangement for SIMD instructions, e.g.: VADDP Vm.<T>, Vn.<T>, Vd.<T> • Width specifier and element index for SIMD instructions, e.g.: VMOV Vn.<T>[index], Rd • Register List, e.g.: VLD1 (Rn), [Vt1.<T>, Vt2.<T>, Vt3.<T>] • Register offset variant, e.g.: VLD1.P (Rn)(Rm), [Vt1.<T>, Vt2.<T>] • Go assembly for ARM64 reference manual: src/cmd/internal/obj/arm64/doc.go • Full details • https://go-review.googlesource.com/c/go/+/41654
  • 9. © 2017 Arm Limited9 Assembly Build Rule • Toolchain will select appropriate assembly files according to GOOS+GOARCH • Using file extensions, e.g. • sys_linux_arm64.s • sys_darwin_arm64.s • Example: assembly files for: hash/crc32 • crc32_amd64p32.s • crc32_amd64.s • crc32_arm64.s • crc32_ppc64le.s crc32_table_ppc64le.s • crc32_s390x.s
  • 10. © 2017 Arm Limited10 Prototype • Function call is the bridge between Go and assembly • Function declaration • src/runtime/timestub.go • func walltime() (sec int64, nsec int32) • Function assembly implementation • runtime/sys_linux_arm64.s package (optional) function name Flag (optional) stack frame size arguments size (optional) Middle dot
  • 11. © 2017 Arm Limited11 Pseudo-registers • FP: Frame Pointer • Points to the bottom of the argument list • Offsets are positive • Offsets must include a name, e.g. arg+0(FP) • SP: Stack Pointer • Points to the top of the space allocated for local variables • Offsets are negative • Offsets must include a name, e.g. ptr-8(SP) • SB: Static Base • Named offsets from a global base Low address High address Low address High address
  • 12. © 2017 Arm Limited12 Calling Convention • All arguments are passed on the stack • Offsets from FP • Return arguments follow input arguments • Start of return arguments aligned to pointer size • All registers are caller saved, except: • Stack pointer register (RSP) • G context pointer register (R28) • Frame pointer (R29)
  • 13. © 2017 Arm Limited13 arm64 Stack Frame w/o frame pointer w/ frame pointer Low address High address
  • 14. © 2017 Arm Limited14 Optimize CRC32 for arm64 – Before • Pure Go table-driven implementation src/hash/crc32/crc32_generic.go 42 func simpleUpdate(crc uint32, tab *Table, p []byte) uint32 { 43 crc = ^crc 44 for _, v := range p { 45 crc = tab[byte(crc)^v] ^ (crc >> 8) 46 } 47 return ^crc 48 }
  • 15. © 2017 Arm Limited15 Optimize CRC32 for arm64 – After • Assembly for arm64 src/hash/crc32/crc32_arm64.s 9 // func castagnoliUpdate(crc uint32, p []byte) uint32 10 TEXT ·castagnoliUpdate(SB),NOSPLIT,$0-36 11 MOVWU crc+0(FP), R9 // CRC value 12 MOVD p+8(FP), R13 // data pointer 13 MOVD p_len+16(FP), R11 // len(p) 14 15 CMP $8, R11 16 BLT less_than_8 17 18 update: 19 MOVD.P 8(R13), R10 20 CRC32CX R10, R9 21 SUB $8, R11 22 23 CMP $8, R11 24 BLT less_than_8 25 26 JMP update … 46 done: 47 MOVWU R9, ret+32(FP) 48 RET 0(FP) ret p.cap p.len p.base crc 32(FP) 8(FP) 16(FP)
  • 16. © 2017 Arm Limited16 Optimize CRC32 for arm64 – Result • Optimization with assembly • 2X-7X speedup
  • 17. © 2017 Arm Limited17 Optimize SHA256 for arm64 • SHA256 introduction block rounds K Hash SHA-256 512bits 64 32bits 32bits 256bits
  • 18. © 2017 Arm Limited18 Optimize SHA256 for arm64 – Message schedule src/crypto/sha256/sha256block.go 84 for i := 0; i < 16; i++ { 85 j := i * 4 86 w[i] = uint32(p[j])<<24 | uint32(p[j+1])<<16 | uint32(p[j+2])<<8 | uint32(p[j+3]) 87 } 88 for i := 16; i < 64; i++ { 89 v1 := w[i-2] 90 t1 := (v1>>17 | v1<<(32-17)) ^ (v1>>19 | v1<<(32-19)) ^ (v1 >> 10) 91 v2 := w[i-15] 92 t2 := (v2>>7 | v2<<(32-7)) ^ (v2>>18 | v2<<(32-18)) ^ (v2 >> 3) 93 w[i] = t1 + w[i-7] + t2 + w[i-16] 94 } for i := 16; i < 64; i+=4 { SHA256SU0 Vn.S4, Vd.S4 SHA256SU1 Vm.S4, Vn.S4, Vd.S4 }
  • 19. © 2017 Arm Limited19 Optimize SHA256 for arm64 – Hash Computation src/crypto/sha256/sha256block.go 98 for i := 0; i < 64; i++ { 99 t1 := h + ((e>>6 | e<<(32-6)) ^ (e>>11 | e<<(32-11)) ^ (e>>25 | e<<(32-25))) + ((e & f) ^ (^e & g)) + _K[i] + w[i] 100 101 t2 := ((a>>2 | a<<(32-2)) ^ (a>>13 | a<<(32-13)) ^ (a>>22 | a<<(32-22))) + ((a & b) ^ (a & c) ^ (b & c)) 102 103 h = g 104 g = f 105 f = e 106 e = d + t1 107 d = c 108 c = b 109 b = a 110 a = t1 + t2 111 } for i := 0; i < 64; i+=4 { SHA256H Vm, Vn, Vd.4S SHA256H2 Vm, Vn, Vd.4S }
  • 20. © 2017 Arm Limited20 Optimize SHA256 for arm64 – Implementation src/crypto/sha256/sha256block_arm64.s
  • 21. © 2017 Arm Limited21 Optimize SHA256 for arm64 – Result • Optimization with assembly • 2X-16X speedup
  • 22. © 2017 Arm Limited22 Optimize IndexByte for arm64 – Before H E L L O W O R L D … R1R0 R2 D R0 src/runtime/asm_arm64.s
  • 23. © 2017 Arm Limited23 Optimize IndexByte for arm64 – After • Assembly implementation with SIMD • SIMD instruction: CMEQ Vm.B16, Vn.B16, Vd.B16 Compare 16 bytes in parallel More details: • Input slice shorter than 16 • Input slice address not 16-byte aligned • Input slice size not 16-byte aligned • Count trailing zeros (not leading zeros) • Implementation: • https://go-review.googlesource.com/c/go/+/41654
  • 24. © 2017 Arm Limited24 Optimize IndexByte for arm64 – Result • Optimization with SIMD • 1.5X-8X speedup
  • 25. © 2017 Arm Limited25 Work Summary Disassembler (arm64): https://go-review.googlesource.com/c/arch/+/43651 https://go-review.googlesource.com/c/arch/+/56810 https://go-review.googlesource.com/c/go/+/58930 https://go-review.googlesource.com/c/go/+/56331https://go-review.googlesource.com/c/go/+/49530 Assembler (arm64): https://go-review.googlesource.com/c/go/+/33594https://go-review.googlesource.com/c/go/+/33595https://go-review.googlesource.com/c/go/+/41511 https://go-review.googlesource.com/c/go/+/41654https://go-review.googlesource.com/c/go/+/45850https://go-review.googlesource.com/c/go/+/54951 https://go-review.googlesource.com/c/go/+/54990https://go-review.googlesource.com/c/go/+/57852https://go-review.googlesource.com/c/go/+/58350 https://go-review.googlesource.com/c/go/+/56030https://go-review.googlesource.com/c/go/+/46438https://go-review.googlesource.com/c/go/+/41653 Optimizations: https://go-review.googlesource.com/c/go/+/40074https://go-review.googlesource.com/c/go/+/61550https://go-review.googlesource.com/c/go/+/61570 https://go-review.googlesource.com/c/go/+/33597https://go-review.googlesource.com/c/go/+/64490https://go-review.googlesource.com/c/go/+/55610 Others: https://go-review.googlesource.com/c/go/+/61511https://go-review.googlesource.com/c/go/+/62850https://go-review.googlesource.com/c/go/+/45112 https://go-review.googlesource.com/c/go/+/44390https://go-review.googlesource.com/c/go/+/42971https://go-review.googlesource.com/c/go/+/40511 https://go-review.googlesource.com/c/arch/+/37172
  • 26. © 2017 Arm Limited26 Next Steps • Crypto optimizations: • aes, elliptic, … • SIMD optimizations: • strings, bytes, runtime, reflect, … • Compiler SSA arm64 back-end optimizations • Others • Internal arm64 linker • Tool for arm64: race detector, memory sanitizer, … • New architecture features • ...
  • 28. © 2017 Arm Limited28 CGo GO ABI C ABI 1 package print 2 3 // #include <stdio.h> 4 // #include <stdlib.h> 5 import "C" 6 import "unsafe" 7 8 func Print(s string) { 9 cs := C.CString(s) 10 C.fputs(cs, 11(*C.FILE)(C.stdout)) 12 C.free(unsafe.Pointer(cs)) 13 } CGo
  • 29. © 2017 Arm Limited29 Useful in macros! Branch Difference from GNU Assembly • On arm64: B is alias for JMP, BL is alias for CALL Jump to labels JMP L1 NOP L1: NOP L2: NOP NOP B L2 Call and Indirect Jump BL $p.foo MOV $p·foo, R3 CALL(R3) B (R3) MOV 0(R26), R4 JMP (R4) Jump relative to PC JMP 2(PC) NOP NOP NOP NOP JMP -2(PC)