Presented at FPGA2013 http://fpganetworks.org/FPGA2013/
Abstract: Field programmable gate arrays (FPGA) are extensively used for rapid prototyping in embedded system applications. While hardware acceleration can be done via specialized processors like a Graphical Processing Unit (GPU), they can also be accomplished with FPGAs for more specialized scenarios. GPUs essentially consist of massively parallel cores and have high memory bandwidth; FPGAs, on the other hand, provide flexibility in terms of customizable I/O and computational resources. In this paper, we explore the usage of GPUs and FPGAs as cryptographic co-processors in streaming dataflow systems with huge rate of data inhalation. Two classic lightweight encryption algorithms, Tiny Encryption Algorithm (TEA) and Extended Tiny Encryption Algorithm (XTEA), are targeted for implementation on GPUs and FPGAs. The GPU implementations of TEA and XTEA in this study depict a maximum speedup of 13x over CPU based implementation. The pipelined FPGA implementation is able to realize a throughput of 6-9x more than the GPU for small plaintext sizes.
Take control of your SAP testing with UiPath Test Suite
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core processors
1. Hardware Acceleration of TEA and XTEA Algorithms on FPGA, GPU and
Multi-Core Processors
Vivek Venugopal and Devu Manikantan Shila {venugov, manikad}@utrc.utc.com
Introduction Tiny Encryption Algorithm (TEA) Extended Tiny Encryption Algorithm (XTEA)
half round1 half round 2 half round1 half round 2
v1 32
32 v1 32 << 4 32
<< 4
<< 4 << 4
k0 32 + k2 32 + v1 32
>> 5
XOR
32
>> 5
XOR
v1 32 32
v1 32
32 + XOR
32 + XOR + +
sum sum
Gateway to 32 32 32 32
v1 >> 5 >> 5 sum0 ky
Internet
GPU + ARM (NVIDIA CARMA) k1 32 + XOR k3 32 + XOR
kx 32 + XOR
sum1 32 + XOR
v1_new
v1_new
Planning 32 +/- v0_new 32 +/- 32 +/- 32 +/-
v0 v1 v0 v0_new v1
Computer
encrypt/decrypt encrypt/decrypt
Encrypted communication
Flight Control and
Navigation Computer • TEA uses addition, XOR and shift operations on 32-bit words • The Extended Tiny Encryption Algorithm (XTEA) was introduced after
and has a very small code footprint. weaknesses for smaller rounds were found in TEA.
Smart meter application FPGA + ARM (Xilinx Zynq)
Unmanned Autonomous Vehicle • TEA has security holes and weaknesses for smaller rounds, • In XTEA, the key scheduling is modified to reflect different patterns for
especially the Avalanche Effect seen for 6 rounds mixing the data and key continuously per round.
• In smart grids, sensitive information such as power
consumption, price update, or outage awareness is
exchanged between the meters and the power utility
Implementation platforms and Results 8000
8000 Intel Xeon X5650 Nvidia C2070
company in real-time over the Internet. • Nvidia's Tesla C2070 high-end GPU, 2 hexa-core Intel Xeon X5650
Nvidia C2070
Intel Quad core i7 Nvidia GT650M
• Unmanned Autonomous Vehicles (UAV) continuously Intel Xeon processors, Nvidia's GeForce GT 650M Intel Quad core i7
Nvidia GT650M 6000
Zynq
exchange dynamic information regarding the urban notebook GPU consisting of 384 cores, quad-core 6000
Throughput in Mbps
Zynq
Throughput in Mbps
environment with a gateway. The gateway also provides Intel Core i7 CPU.
feedback regarding the optimization parameters that • Xilinx's Zynq-7000 SoC ZC702 evaluation board. 4000
4000
need to be fed into the UAV's path planning algorithm The Zynq-7000 platform consists of a dual ARM
for mapping different routes to reach it's destination Cortex A-9 processor clocked at 800 MHz and 2000
2000
safely. Artix-7 FPGA as the programmable logic. Streaming Multiprocessor (SMX) Architecture
Kepler GK110’s new SMX introduces several architectural innovations that make it not only the most
• Cyber attacks on such critical and dynamic
powerful multiprocessor we’ve built, but also the most programmable and power efficient.
Copy input data and
keys to GPU memory
0
information can lead to severe losses of 0
8 KB 16 KB 8 MB 128 MB 1 GB
8 KB 16 KB 8 MB 128 MB 1 GB
resources and finance. SMX
Control Logic
SMX
Control Logic
pre-compute sum values
for each round and store
in shared memory Plaintext size
Plaintext size
Throughput (Mbps) comparison of TEA Throughput (Mbps) comparison of XTEA
Motivation calculate ciphers for
blocks in parallel
• All the information from/to these smart meters need GT650M: 2 SMX with
copy ciphers back to
CPU
Conclusion
to be decrypted/encrypted at the gateway, which in 192 cores each Inside SMX GPU Implementation
• GPUs and FPGAs provide better throughput for both TEA and XTEA as
SMX: 192 single precision CUDA cores, 64 double precision units, 32 special function units (SFU), and 32 load/store units
(LD/ST).
turn can lead to very large response times. A larger
compared to CPUs.
Flash DRAM SRAM
response time implies poorer performance in terms of
both throughput and latency.
GIGe
USB
Processing
System
Memory
Interfaces Custom
Displays
PCIe Running on Zynq board Running in ISIM
• FPGAs perform better for smaller plaintext sizes whereas GPUs are better for
larger plaintext sizes.
• Continuous transmission of data from UAV regarding CAN
AXI Interconnect
• In terms of development time and cost, GPUs are better suited as embedded
Dual ARM Cortex A-9
Fixed MPCore (800 MHz)
I2C Peripheral
peripherals
the evidence grid need to be encrypted fast.
SelectIO
Resources
Processing Programmable
SD System Logic
cryptography co-processors as compared to FPGAs.
JTAG
• FPGAs and GPUs can be used in gateways to speed
UART
2x 12-bit
Custom Programmable
• Future research efforts may address the use of Zynq platform as a complete, low-
GPIO MSPS ADC Memory
Logic
up the TEA/XTEA encryption and decryption of bulk
information for improved throughput and latency.
Analog Monitors Analog
cost cryptographic co-processor for more complex cryptographic algorithms
Zynq Internal block diagram Hardware in Loop setup
References
[1] D. J. Wheeler and R. M. Needham. TEA, a tiny encryption algorithm, 1995.
[2] D. J. Wheeler and R. M. Needham. TEA extensions. Technical report, Cambridge University, England, October 1997.
[3] Xilinx Inc. Xilinx Zynq-7000 SoC ZC702 Evaluation kit.
[4] Nvidia Inc. (Last Accessed: February 2012) Nvidia Tesla C2070 GPU Computing Processor, Nvidia GeoForce GT650M Notebook GPU [Available Online]