4. Silicon Solutions
Decision table for designers of real-time
“Choosing the Right Architecture for Real-Time Signal Processing Designs”, Leon Adams, Texas Instruments
4
8. 8
Why do we need DSP processors?
The Sum of Products (SOP) or Multiply-
accumulate(MAC) is the key element in most DSP
algorithms:Algorithm Equation
Finite Impulse Response Filter
M
k
k knxany
0
)()(
Infinite Impulse Response Filter
N
k
k
M
k
k knybknxany
10
)()()(
Convolution
N
k
knhkxny
0
)()()(
Discrete Fourier Transform
1
0
])/2(exp[)()(
N
n
nkNjnxkX
Discrete Cosine Transform
1
0
12
2
cos).().(
N
x
xu
N
xfucuF
9. 9
Hardware vs. Software multiplication
DSP processors are optimized to perform
multiplication and addition operations.
Multiplication and addition are done in hardware
and in one cycle.
Example: 4-bit multiply (unsigned).
1011
x 1110
1011
x 1110
Hardware Software
10011010 0000
1011.
1011..
1011...
10011010
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
11. 11
C6000 System Block Diagram
P
E
R
I
P
H
E
R
A
L
S
Internal Memory
Internal Buses
External
Memory
.D1
.M1
.L1
.S1
.D2
.M2
.L2
.S2
Regs(B0-B15)
Regs(A0-A15)
Control Regs
CPU
12. 12
C6000 Central Processing Unit
P
E
R
I
P
H
E
R
A
L
S
Internal Memory
Internal Buses
External
Memory
.D1
.M1
.L1
.S1
.D2
.M2
.L2
.S2
Regs(B0-B15)
Regs(A0-A15)
Control Regs
CPU
13. 13
Implementation of Sum of Products
(SOP)
SOP is the key element
for most DSP algorithms.
let’s write the code for
this algorithm and at the
same time discover the
C6000 architecture.
The implementation in
this module will be done
in assembly.
Two basic
operations are required
for this algorithm.
(1) Multiplication
(2) Addition
Therefore two basic
instructions are required
Y =
N
an xn
n = 1
*
= a1 * x1 + a2 * x2 +... + aN * xN
14. 14
Multiply (MPY)
The multiplication of a1 by x1 is done in
assembly by the following instruction:
MPY a1, x1, Y
This instruction is performed by a
multiplier unit that is called “.M”
Y =
N
an xn
n = 1
*
= a1 * x1 + a2 * x2 +... + aN * xN
15. 15
Multiply (.M unit)
.M
Y =
40
an xn
n = 1
*
The . M unit performs multiplications in
hardware
MPY .M a1, x1, Y
17. 17
Add (.L unit)
.M
.L
Y =
40
an xn
n = 1
*
MPY .M a1, x1, prod
ADD .L Y, prod, Y
C6000 use registers to hold the operands, so lets change this
code.
18. 18
Register File - A
Y =
40
an xn
n = 1
*
MPY .M a1, x1, prod
ADD .L Y, prod, Y
.M
.L
A0
A1
A2
A3
A4
A15
Register File A
.
.
.
a1
x1
prod
32-bits
Y
Let us correct this by replacing a, x, prod and Y by the registers
as shown above.
19. 19
Specifying Register Names
Y =
40
an xn
n = 1
*
MPY .M A0, A1, A3
ADD .L A4, A3, A4
Register File A contains 16 registers (A0 -A15) which are 32-bits
wide.
.M
.L
A0
A1
A2
A3
A4
A15
Register File A
.
.
.
a1
x1
prod
32-bits
Y
20. 20
Data loading
Q: How do we load the
operands into the registers?
.M
.L
A0
A1
A2
A3
A4
A15
Register File A
.
.
.
a1
x1
prod
32-bits
Y
21. 21
Load Unit “.D”
.M
.L
A0
A1
A2
A3
A15
Register File A
.
.
.
a1
x1
prod
32-bits
Y
.D
Data Memory
A: The operands are loaded
into the registers by loading
them from the memory
using the .D unit.
Q: How do we load the
operands into the registers?
Q: Which instruction(s) can be
used for loading operands
from the memory to the
registers?
A: The load instructions.
(LDB, LDH,LDW,LDDW)
22. 22
Using the Load Instructions
Y =
40
an xn
n = 1
*
LDH .D *A5, A0
LDH .D *A6, A1
MPY .M A0, A1, A3
ADD .L A4, A3, A4
.M
.L
A0
A1
A2
A3
A15
Register File A
.
.
.
a1
x1
prod
32-bits
Y
.D
Data Memory
23. 23
Creating a loop
So far we have only
implemented the SOP for
one tap only, i.e.
Y= a1 * x1
So let’s create a loop so
that we can implement
the SOP for N Taps.
Y =
40
an xn
n = 1
*
LDH .D *A5, A0
LDH .D *A6, A1
MPY .M A0, A1, A3
ADD .L A4, A3, A4
24. 24
Create a label to branch
loop LDH .D *A5, A0
LDH .D *A6, A1
MPY .M A0, A1, A3
ADD .L A4, A3, A4
Y =
40
an xn
n = 1
*
25. 25
Add a branch instruction, B.
loop LDH .D *A5, A0
LDH .D *A6, A1
MPY .M A0, A1, A3
ADD .L A4, A3, A4
B .? loop
Y =
40
an xn
n = 1
*
26. 26
Which unit is used by the B instruction?
.S
Y =
40
an xn
n = 1
*
.M
.L
A0
A1
A2
A3
A15
Register File A
.
.
.
a1
x1
prod
32-bits
Y
.D
Data Memory
loop LDH .D *A5, A0
LDH .D *A6, A1
MPY .M A0, A1, A3
ADD .L A4, A3, A4
B .S loop
27. 27
How can we add more processing
power to this processor?
.S
.M
.L
A0
A1
A2
A3
A15
Register File A
.
.
.
32-bits
.D
Data Memory
(1 ) Increase the clock
frequency.
(2 ) Increase the number
of Processing units.
28. 28
Increase the number of Processing
units
.S
.M
.L
A0
A1
A2
A3
A15
Register File A
.
.
.
32-bits
.D
Data Memory
.S2
.M2
.L2
.D2
B0
B1
B2
B3
B15
Register File B
.
.
.
32-bits
29. 29
C6211 Instruction Set (by unit)
.S Unit
MVKLH
NEG
NOT
OR
SET
SHL
SHR
SSHL
SUB
SUB2
XOR
ZERO
ADD
ADDK
ADD2
AND
B
CLR
EXT
MV
MVC
MVK
MVKL
MVKH
.M Unit
SMPY
SMPYH
MPY
MPYH
.L Unit
NOT
OR
SADD
SAT
SSUB
SUB
SUBC
XOR
ZERO
ABS
ADD
AND
CMPEQ
CMPGT
CMPLT
LMBD
MV
NEG
NORM
.D Unit
STB/H/W
SUB
SUBA
ZERO
ADD
ADDA
LDB/H/W
MV
NEG
Other
IDLENOP
30. 30
C language vs Assembly
Hand
Optimize
Assembly
Optimizer
Compiler
Optimizer
Source Efficiency Effort
C
Linear
ASM
ASM
70-100%
95-100%
100%
Low
Med
High
35. 35
C6416 Memory Map
FFFF_FFFF
0000_0000 1024KB Internal
(L2 cache)
Internal Memory
Unified (data or prog)
1024KB
On-chip Peripherals
0180_0000
External Memory
Async (SRAM, ROM, etc.)
Sync (SBSRAM, SDRAM)
6000_0000
8000_0000
EMIFB 64MB x 4
External
Level 1 Cache
16KB Program
16KB Data
Not in map CPU L2
1024K
16K
P
16K
D
EMIFA 256MB x 4
External
36. 36
Memory Allocation
C source code
Compiler
Assmebler
COFF
Object file
Text
Data
Bss
COFF
Object file
ROM
External RAM
Internal RAM
Target Memory0x00000
0xfffff
SECTION
Stack
Heap
Text
Data
Bss
MEMORY
Memory Layout
MEMORY
{
ISRAM : origin = 0x00000000, len = 0x00100000
}
SECTIONS
{
.text > ISRAM
}
37. 37
What is stored in memory ?
What is stored in memory ?
Code
Constants
Global and static variables
Local variables
Dynamic memory
Memory
0x00000
0xfffff
38. 38
How is memory organized?
How is memory organized?
text : Code and constant data
data : Initialized global and
static variables
bss : Unintialized global and
static variables
stack :
Local variables
Function return addresses
Arguments of function
heap : Dynamic memory
Memory
0x00000
0xfffff
stack
heap
bss
data
text
39. 39
How is memory allocated?
How is memory allocated ?
long array[100];
long bufsize =100;
int main(void) {
int i;
char* buf;
i=10;
buf=f1(i);
return(0);
}
Char* f1(int n){
int k;
Return malloc(bufsize);
}
Memory
0x00000
0xfffff
heap
bss
data
text
stack
100 byte block
array[100]
bufsize = 100
int main(void) {
i=10;
buf=f1(i);
return(0);
} …
Main return address
i
buf
f1 argument n
f1 return address
k
40. 40
Memory Allocation & Deallocation
How, and when , is memory allocated?
Gobal and static variables = program startup
Local variables = function call
Dynamic memory = malloc()
How, and when, is memory deallocated?
Global and static variables = program finish
Local variables = function return
Dynamic memory = free()
41. 41
When is memory allocated?
long array[100];
long bufsize =100;
int main(void) {
int i;
char* buf;
i=10;
buf=f1(i);
return(0);
}
Char* f1(int n){
int k;
Return malloc(bufsize);
}
bss : 0 at startup
data : 100 at startup
Stack : at function call
Stack : at function call
Heap : 100 bytes at malloc()
42. 42
When is memory deallocated?
long array[100];
long bufsize =100;
int main(void) {
int i;
char* buf;
i=10;
buf=f1(i);
return(0);
}
Char* f1(int n){
int k;
Return malloc(bufsize);
}
Available till termination
Available till termination
Deallocate on return from main()
Deallocate on return from f1()
Deallocate on free()
43. 43
Sections defined in C6000 compiler
Initialized sections
.cinit : Initial values for global/static variables
.const : Global and static string literals
.switch : Tables for switch instructions
.text : code
Uninitialized sections
.bss : Global and static variables
.stack : Stack(local variables, return address, arguments)
.far : Global and statics declared far
.sysmem : Memory for malloc functions (heap)