Chapter 1
Syllabus
Catalog Description: Computer structure, machine representation of data,
addressing and indexing, computation and control instructions, assembly
language and assemblers; procedures (subroutines) and data segments,
linkages and subroutine calling conventions, loaders; practical use of an
assembly language for computer implementation of illustrative examples.
Course Goals
0 Knowledge of the basic structure of microcomputers - registers, mem-
ory, addressing I/O devices, etc.
1 Knowledge of most non-privileged hardware instructions for the Ar-
chitecture being studied.
2 Ability to write small programs in assembly language
3 Knowledge of computer representations of data, and how to do simple
arithmetic in binary & hexadecimal, including conversions
4 Being able to implementing a moderately complicated algorithm in
assembler, with emphasis on efficiency.
5 Knowledge of procedure calling conventions and interfacing with high-
level languages.
Optional Text: Kip Irvine, Assembly Language for the IBM PC, Prentice
Hall, 4th or 5th edition
1
Additional References: Intel and DOS API documentation as presented
in Intel publications and online at www.x86.org; lecture notes (to be sup-
plied as we go).
Prerequisites by Topic. Working knowledge of some programming lan-
guage (102/103: C/C++); Minimal programming experience
Major Topics Covered in the Course:
1 Low-level and high-level languages; why learn assembler?
2 How does one study a new computer: the CPU, memory, addressing
modes, operation modes.
3 History of the Intel family of microprocessors.
4-5 Registers; simple arithmetic instructions; byte order; Arithmetic and
logical operations.
6 Implementing longer integer type support; carry and overflow.
7 Shifts, multiplication and division.
8 Memory layout.
9 Direct video memory access; discussion of the first project.
10 Assembler syntax; how to use the tools.
11-13 Conditional & unconditional jumps; loops; emulating high-level lan-
guage constructions; Stack; call and return; procedures
14-15 String instructions: effcient memory-to-memory operations.
16 Interrupts overview: interrupt table; how do interrupts work; classif-
cation.
17 Summary of the most important interrupts.
18-20 DOS interrupt; File I/O functions; file-copy program; discussion of
the second project
21 Interrupt handlers; keyboard drivers; timer-driven processes; viruses
and virus-protection software.
2
22 Debug interrupts; how do debuggers and profilers work.
23-24 (Optional).interfacing with high level languages; Protected mode fun-
damentals
Grading The grading is based on two projects, midterm project is 49%
and the final is 51%. Please note that the projects are individual, submitting
projects that are similar to submissions of others and/or are essentially
downloads from the Web would result in a fail.
Office Hours My hours this term for CSc 210 will be 3:45 ¶Ł 4:45 on
Mondays.
Zoom links:
11am https://ccny.zoom.us/j/8 ...
1. Chapter 1
Syllabus
Catalog Description: Computer structure, machine
representation of data,
addressing and indexing, computation and control instructions,
assembly
language and assemblers; procedures (subroutines) and data
segments,
linkages and subroutine calling conventions, loaders; practical
use of an
assembly language for computer implementation of illustrative
examples.
Course Goals
0 Knowledge of the basic structure of microcomputers -
registers, mem-
ory, addressing I/O devices, etc.
1 Knowledge of most non-privileged hardware instructions for
the Ar-
chitecture being studied.
2 Ability to write small programs in assembly language
3 Knowledge of computer representations of data, and how to
do simple
arithmetic in binary & hexadecimal, including conversions
4 Being able to implementing a moderately complicated
2. algorithm in
assembler, with emphasis on efficiency.
5 Knowledge of procedure calling conventions and interfacing
with high-
level languages.
Optional Text: Kip Irvine, Assembly Language for the IBM PC,
Prentice
Hall, 4th or 5th edition
1
Additional References: Intel and DOS API documentation as
presented
in Intel publications and online at www.x86.org; lecture notes
(to be sup-
plied as we go).
Prerequisites by Topic. Working knowledge of some
programming lan-
guage (102/103: C/C++); Minimal programming experience
Major Topics Covered in the Course:
1 Low-level and high-level languages; why learn assembler?
2 How does one study a new computer: the CPU, memory,
addressing
modes, operation modes.
3 History of the Intel family of microprocessors.
4-5 Registers; simple arithmetic instructions; byte order;
3. Arithmetic and
logical operations.
6 Implementing longer integer type support; carry and overflow.
7 Shifts, multiplication and division.
8 Memory layout.
9 Direct video memory access; discussion of the first project.
10 Assembler syntax; how to use the tools.
11-13 Conditional & unconditional jumps; loops; emulating
high-level lan-
guage constructions; Stack; call and return; procedures
14-15 String instructions: effcient memory-to-memory
operations.
16 Interrupts overview: interrupt table; how do interrupts work;
classif-
cation.
17 Summary of the most important interrupts.
18-20 DOS interrupt; File I/O functions; file-copy program;
discussion of
the second project
21 Interrupt handlers; keyboard drivers; timer-driven processes;
viruses
and virus-protection software.
2
4. 22 Debug interrupts; how do debuggers and profilers work.
23-24 (Optional).interfacing with high level languages;
Protected mode fun-
damentals
Grading The grading is based on two projects, midterm project
is 49%
and the final is 51%. Please note that the projects are
individual, submitting
projects that are similar to submissions of others and/or are
essentially
downloads from the Web would result in a fail.
Office Hours My hours this term for CSc 210 will be 3:45 ¶Ł
4:45 on
Mondays.
Zoom links:
11am https://ccny.zoom.us/j/85378437821
2pm https://ccny.zoom.us/j/87625527827
3
https://ccny.zoom.us/j/87625527827
https://ccny.zoom.us/j/85378437821
Chapter 2
Preliminary material
5. 4
: Why assembler?
• Why take this class?
• Why program assembler?
• Why know assembler?
5
: NOTE: think Binary!
Why binary?
Binary numbers (WIKI)
(brief answer: because this is easy to implement)
Why hex?
Hexadecimal numbers (WIKI)
(brief answer: because it is much easier to work with shorter
strings)
What about DNA?
6
https://en.wikipedia.org/wiki/Hexadecimal
https://en.wikipedia.org/wiki/Binary_number
2.1 Introduction #1: looking at new hard-
ware
• CPU, general purpose (arithmetic) registers
6. – How large?
– How many?
– Are they all the same?
– Modes?
• Memory Model
– Is all memory the same?
– Flat?
– Segmented?
– Paged?
• Other hardware (peripherals)
• OS
• Special features
7
2.2 Introduction #2: History
Intel Processors Over the Years
The History Of Intel CPUs
-1971 before Intel
1971 4004
7. • Intention
• Name
• Usage
• What can you do with 4 bits?
1972 8008
- Doubling – what can you do with 8 bits
1974 8080
1975 8085
1975 Z80
1974 CP/M – Digital Research, Gary Kildall
1978
- 8086 – X86 architecture. 8 8bit registers, 8(+6) 16 bit
registers. 1mb
limit. 1mb mystery?
1979 8088 – cost cutting
1981 iAPX 432 – an attempted 32 bit processor
1982 80186 – minor improvements/corrections
1981 IBM PC
8
9. Problem: Let AH = 2,AL = 3. What is AX?
Solution
:
00000010 00000011
AH=00000010b=02h
AL=00000011b=03h
AX=0000001000000011b = 0203h = 515d
Note: suffices b, h and d are part of the Assembly language
syntax;
d(ecimal) is the default. Assignment syntax, however, is
different, it is only
used for an illustration here.
Fast solution:
AX = 2*256+3
10. AX = (2<<8)+3
Problem: Let AX = 2020. What are AL and AH ?
Registers BX,CX, DX are divided similarly.
General purpose aka Arithmetic registers:
Sequence A,B,C,D is an illusion, these letters stand for
Accumulator,
Base, Count, Data.
8 8bit registers: AH,AL,BH,BL,CH,CL,DH,DL 8 16bit registers:
AX,BX,CX,DX
and SI,DI,BP(?),SP(??)
SP generally cannot be used for calculations, BP usually cannot
be used
either.
(32 bit to be described later)
11
11. IP – Instruction pointer
points to the first byte of the current instruction.
Code:
B B B B B B B B B B B B B B B B B B
Code is essentially a one dimensional array of bytes (in C/C++
– un-
signed char type).
IP initially is 0, after one instruction is executed it should be 2,
then 5,
then 6, ....
Simplified logic (one instruction)
byte code[MAXCODE];
byte opcode;
opcode=code[IP++];
switch (opcode) {
case 0x00: ...
12. case 0x01: ...
...
case 0xFF: ...
}
each subcase will read additional bytes if needed to complete
reading of
the instruction.
12
Simplified logic (full execution)
byte code[MAXCODE];
byte opcode;
while(true) {
opcode=code[IP++];
13. switch(opcode) {
case 0x00: ...
....
case 0xFF: ...
}
}
Why is this simplified?
• CS is also used.
• how do we terminate?
• how do we change the executed sequence?
What do we do with this?
Is switch efficient?
Question: what would IP=k do (if such instruction exists).
13
14. FLAGS register
Should be seen not as a single 16-bit register but as a collection
of 16
1-bit registers.
More important ones: ZF, SF, CF, DF
Neither FLAGS nor the names above are keywords.
14
Segment registers : CS, DS, SS, ES – specify where segments
(“parts”) of
the program are located.
• CS Code Segment
• DS Data Segment
• SS Stack Segment
15. • ES Extra Segment
15
2.5 8086 registers – full list
• AX Accumulator eXtended
• AL Accumulator Low
• AH Accumulator High
• BX Base eXtended
• BL Base Low
• BH Base High
• CX Count eXtended
• CL Count Low
• CH Count High
16. • DX Data eXtended
• DL Data Low
• DH Data High
• SI Source Index
• DI Destination Index
• BP Base Pointer
• SP Stack Pointer
• CS Code Segment
• DS Data Segment
• SS Stack Segment
• ES Extra Segment
• IP Instruction Pointer (not a keyword)
• Flags Flags (not a keyword)
17. 16
2.6 General addressing scheme
Three distinct ways to address memory:
• Absolute address : mem[offset] (flat model–generally cannot
be done)
• Segmented address : mem[f(seg,offset)] (done by hardware).
Usual
notation: ssss:oooo (hex digits)
• Expressing segmented address in assembly syntax – to be
covered
later
The f(seg,offset) function is mode-dependent.
In real mode, f(seg,offset)=seg*16+offset.
This allows to build 20 bit numbers out of 16 bit quantities.
Examples
0000:0000 =⇒ 00000
18. 1234:5678 =⇒ 179B8
+ 12340
05678
--------
179B8
The mapping is not one-to-one! Different (seg,offset) pairs may
point to
the same address.
0000:0100 =⇒ 00100
0010:0000 =⇒ 00100
Puzzle
FFFF:FFFF =⇒ ?????
(ref: A10 address line)
==
17
19. Code segment is effectively mem[f(CS,i)], Data segment is
effectively
mem[f(DS,i)]
Protected memory addressing function uses Segment Descriptor
Table
lookup. Fields include Base, Limit, Access Rights.
Implication: instructions Segment<-value are very costly in
protected
mode.
18
2.7 Back to History: Original IBM PC (1981)
Distorted:
Timeline
IBM’s brand recognition, along with a massive marketing
20. campaign, ignites the fast growth of the personal computer mar -
ket with the announcement of its own personal computer (PC).
The first IBM PC, formally known as the IBM Model 5150, was
based on a 4.77 MHz Intel 8088 microprocessor and used Mi -
crosofts MS-DOS operating system. The IBM PC revolutionized
business computing by becoming the first PC to gain
widespread
adoption by industry. The IBM PC was widely copied
(“cloned”)
and led to the creation of a vast “ecosystem” of software, pe -
ripherals, and other commodities for use with the platform.
Better:
WIKIPEDIA article
Additional link (on reaction):
Orson Scott Card’s novel
19
https://en.wikipedia.org/wiki/Lost_Boys_(novel)
https://en.wikipedia.org/wiki/IBM_Personal_Computer
https://www.computerhistory.org/timeline/1981/
No OS !
21. Three options:
• CP/M-86 (Control program for Microcomputers), see also DR
page
• UCSD p-System
• PC DOS/MS DOS, see also 86-DOS
See also: PL/M
Introduction #2: History (cont)
1982 80186, 80188
1982-1991 80286
1985-2007 80386
80186 : almost not used in PC’s, many improvements in
instructions
(kept).
80286 : 16mb protected mode–promise not fullfilled.
Real mode −→−→ Prot mode
23. Doubling of registers again
EAX = xxxxxxxxxxxxxxxx ahahahah alalalal
22
Flags register becomes EFLAGS :
Additionally:
• Control Registers CR0..CR7 (CR0=MSW(Machine Status
Word) on
80286)
• Test Registers TR0..TR7
• Debug Registers DR0..DR7
64 bit mode adds RAX,...
23
24. 24
On paging
Virtual memory allows to execute programs larger than physical
mem-
ory.
Generally cannot be controlled by the programmer, paging
algorithms
are implemented by the OS
Page replacement algorithms
Application algorithms can be tailored for paging environment.
Example:
#define N 1024
int x[N][N],y[N][N],z[N][N];
int i,j;
25. for (int i=0; i<N; i++)
for (int j=0; j<N; j++)
z[i][j]=x[i][j]+y[i][j];
vs
#define N 1024
int x[N][N],y[N][N],z[N][N];
int i,j;
for (int i=0; i<N; i++)
for (int j=0; j<N; j++)
z[j][i]=x[j][i]+y[j][i];
Will the two programs run equally fast ?
25
https://www.geeksforgeeks.org/page-replacement-algorithms-in-
26. operating-systems/
Assume 3 pages are available. (1 page is exactly a row of a
matrix
above.)
Two dimensional arrays are stored row by row.
First program : 1024 swaps.
Second program : 10242 swaps.
Technical info
26
https://wiki.osdev.org/Paging
2.8 Back to History: 32 bit OS?)
OS/2 1987-2001
27
https://en.wikipedia.org/wiki/OS/2
30. where
<label> – optional label (any identifier that is not a keyword or
defined oth-
erwise).
<opcode> – name of the instruction (keyword)
<operands> – comma-separated operands, if any; their number
(0-3) depends on
the opcode
;comment – any text, ignored up to the EOL.
Trivial example:
30
lab: ; this line does not do anything
Symbolic representation of instructions corresponds to
particular se-
31. quence of bytes which are actually executed.
3.2 The NOP instruction
NOP (do nothing)
Binary representation: one byte, hex value 90h.
Execution:
Before:
bb bb bb bb bb bb bb bb bb
↑IP
90 bb bb bb bb bb bb
After:
bb bb bb bb bb bb bb bb bb 90
↑IP
bb bb bb bb bb bb
IP is incremented by 1; no other register is changed
31
32. WHY have it?
• delay?
• padding for sloppy compilers
• patching (code deletion)
• reserving space for patching(code addition)
32
3.3 The MOV instruction
MOV dst,src (copy src to dst)
Example:
MOV AL,BL
;
34. MOV IP,AX ; illegal
MOV AX,CS ; ok, special case (see below)
MOV DS,AX ; ok, special case (see below)
MOV CS,DX ; special case, illegal
MOV DS,EDI ; illegal
MOV CR0,EAX ; priveleged
MOV DR0,EAX ; ok, special case (see below)
RULE #1: size of src and dst must match
Most instructions support only gp regis-
ters
33
Argument types:
36. r m i
r . . .
m . . .
i . . .
Move-specific template:
r m i s
r . . . .
m . . . .
i . . . .
s . . . .
35
Right now:
r m i
r X . .
m . . .
i . . .
37. Examples:
MOV AL,[100] ; brm
MOV BX,[200] ; wrm
MOV EDI,[400] ; drm
MOV [100],AL ; bmr
MOV [200],BX ; wmr
MOV [400],EDI ; dmr
Thus
r m i
r X X .
m X . .
i . . .
What does [#] really mean?
Answer: bytes beginning with byte #.
in
38. MOV AX,[100]
which byte goes where?
36
Examples:
MOV AL,1 ; bri
MOV DX,2 ; wri
MOV EDI,4 ; dri
r m i
r X X X
m X . .
i . . .
Examples:
MOV AL,97 ;
39. MOV AL,61h ; all four lines are equivalent
MOV AL,01100001b
MOV AL,’a’ ;
...
MOV AL,1000 ; ???
37
No storing into immediates, this would be like
1=x;
in C.
Thus:
r m i
r X X X
m X . .
i × × ×
40. Important: MOV with immediate is a fundamentally different
operation
from the rr,rm, mr forms.
38
RULE #2: no memory-to-memory
(2 exceptions later)
Thus:
r m i
r X X X
m X × ?
i × × ×
MOV [100],1 ; should not compile
RULE #3: size must be known
Correct syntax:
41. MOV byte ptr [100],1
MOV word ptr [100],1
MOV dword ptr [100],1
MOV qword ptr [100],1 ; 64 bit only
MOV tbyte ptr [100],1 ; ???
What about
MOV [100],AL
MOV byte ptr [100],AL ; unneeded
MOV word ptr [100],AL ; will not compile
Final result:
r m i
r X X X
m X × X
i × × ×
42. 39
Full table (MOV only):
r m i s
r X X X X
m X × X X
i × × × ×
s X X × ×
3.3.1 Examples
Here is how C/C++ assignments may be compiled:
char c1,c2; c1=c2;
-------------------
MOV AL,c2
MOV c1,AL
short s1,s2; s1=s2;
44. MOV x,EAX;
MOV y,EAX;
int x; x=0;
-------------------
MOV x,0;
int x,y,z; x=y=z=0;
-------------------
MOV x,0
MOV y,0
MOV z,0
perhaps, a better implementation?
MOV EAX,0 ; could be even better
MOV x,EAX
45. MOV y,EAX
MOV z,EAX
41
Exercise: Exchange bytes in [100] and [101]
MOV AL,[100]
MOV AH,[101]
MOV [100],AH
MOV [101],AL
can this be done in fewer lines of code?
MOV AX,[100]
MOV [100],AH
46. MOV [101],AL
Note: Byte order matters.
42
3.3.2 Byte order
Consider:
MOV [100],AX
Does
LE,reversed AL go into [100] and AH into [101] or, instead:
BE,normal AH go into [100] and AL into [101]
More than you want to know on Endianness
LE,reversed : Intel, Dec
BE,normal : IBM mainframe, Motorola, Sun
47. Practical implications:
• it is important to know the endiness of the hardware and the
data.
• it is important to be able to swap.
• it is important to be able determine the endiness. How?
Specific example of byte order importance:
short s=1;
FILE *f=fopen("try.dat","wb");
if (!f) { ... error handling ... }
fwrite(&s,1,sizeof(s),f);
fclose(f);
Should create a 2-byte file try.dat.
Now,
48. 43
https://en.wikipedia.org/wiki/Endianness
short s;
FILE *f=fopen("try.dat","rb");
if (!f) { ... error handling ... }
fread(&s,1,sizeof(s),f);
fclose(f);
cout << s;
should print the value of s – indeed 1.
But: what will happen if we run the Writing program on an Intel
comp,
move the data file to a Sun, and run the reading program there?
Exercise: Can a high-level program be written that determines
the order of bytes?
49. 44
3.4 The XCHG instruction
XCHG dst,src (exchange src with dst)
XCHG r m i
r X X ×
m X × ×
i × × ×
Segment and other non-gp registers are not supported.
The syntax and examples from MOV apply, except for non-use
of non-gp
registers and immediates.
Examples (which of the following are valid?)
XCHG AL,AH
XCHG AX,SP
50. XCHG EAX,EDI
XCHG AL,[400]
XCHG [400],AL ;same as above
XCHG AL,DI
XCHG DI,DS
XCHG EAX,7
XCHG [100],[101]
XCHG AX,AX ; nop?
XCHG DI,DI ; nop?
XCHG CL,CL ; nop?
45
Can a better version of byte swap program be now written?
51. Better:
MOV AX,[100]
XCHG AL,AH
MOV [100],AX
Yet better:
XCHG AX,[100]
XCHG AL,AH
XCHG [100],AX
Q: can a shorter program be written (perhaps with another
instruc-
tion)?
46
52. 3.4.1 Binary encoding of XCHG
We only consider accumulator exchanges now.
Instructions
XCHG AX,reg
are extra optimized in the intel architecture.
90h XCHG AX,AX
91h XCHG AX,CX
92h XCHG AX,DX
93h XCHG AX,BX
94h XCHG AX,SP
95h XCHG AX,BP
96h XCHG AX,SI
97h XCHG AX,DI
53. Q: Why the # of registers is a power of 2 ?
A: Because this allows to represent registers as in a fixed
number of bits.
47
16-bit register representation:
000b AX
001b CX
010b DX
011b BX
100b SP
101b BP
110b SI
111b DI
54. An emulator may use code like
unsigned short regs[8];
#define AX regs[0]
#define CX regs[1]
#define DX regs[2]
#define BX regs[3]
#define SP regs[4]
#define BP regs[5]
#define SI regs[6]
#define DI regs[7]
Notes:
• this is just an example!
55. • 8 bit registers have their own 3-bit keys
• 32 bit registers parallel 16 bit registers
• 64 bit registers use 4-bit keys
• The above code should define 8-bit regs properly (f.e. setting
AX
should set AL,AH too!
48
• The above code should be modified to support 32 bit registers
texttt{XCHG AX,AX}
is NOP. General encoding scheme of XCHG (with accumulator):
1 0 0 1 0 r e g
This idea is used in other instructions. XCHG without
accumulator uses
a l8engthier encoding, with first byte 86h/87h.
57. 3.5 The ADD instruction
ADD dst,src (dst += src)
(proper name should be increment by.)
General 2-operand instruction layout applies:
ADD r m i
r X X X
m X × X
i × × ×
Given that syntax of ADD is largely similar to MOV, the
examples are sim-
ilar:
ADD AX,BX
ADD EAX,ESP
ADD DL,CL
ADD AX,[100]
ADD [150],EAX
58. ADD AX,DS ; illegal
ADD AX,DL ; illegal
ADD [10],5 ; syntax error
ADD word ptr [10],5 ; fine
C example:
int x,y,z; x=y+z;
-----
MOV EAX,y
51
ADD EAX,z
MOV x,EAX
59. int x,y,z; x=x+y;
-----
MOV EAX,y
ADD x,EAX
int x,y,z; x=x+25;
-----
ADD x,25
(NOTE: size specification is not required if x is declared to be a
double
word)
52
Consider
ADD AL,AL
60. Generally, multiplication by 2 should not be done as
multiplication
(generally about 3x slower than addition). Writing
int x; x=2*x;
is wrong! One should use either addition or a shift (if
available). (What is
better depends on the situation and hardware).
Q: Should we replace multiplication by addition in :
int f(int);
int x; x=2*f(x);
More simple examples:
Consider
ADD AL,0 ; nop ?
ADD AL,1 ; increment ?
ADD AL,-1 ; decrement ?
63. c=1;
printf("%d",c); c=c+c;
printf("%d",c); c=c+c;
printf("%d",c); c=c+c;
printf("%d",c); c=c+c;
printf("%d",c); c=c+c;
....
Note: if you like C++ and cout<<, make sure to cast!
54
Q: what would be the output if we use char rather than unsigned
char?
Is this a size problem ? Try
64. MOV AX,1
ADD AX,AX
...
OR
MOV EAX,1
ADD EAX,EAX
...
OR
C/C++ versions.
55
Unlike MOV and XCHG, ADD is an arithmetic instruction: it
sets flags.
Warning: the discussion of the flags is slightly simplified, I’m
65. not con-
sidering the OF. Thus there are slight differences between the
behavior
described and the actual behavior of the processor. This makes
no differ-
ence for most programs, but there are rare instances where this
matters. In
particular, I will consider JS and JL as equivalent, in reality
they are not
exactly the same.
ZF Zero Flag
SF Sign Flag
CF Carry Flag
OF Overflow Flag
; ZF SF CF
MOV AL,1 ; AL=1 binary |00000001 ? ? ?
ADD AL,AL ; AL=2 binary |00000010 0 0 0
67. 56
3.5.1 Overflow detection
unsigned int x,y,z;
....
x=y+z; // concern about overflow
unsigned int x,y,z;
....
y=0x90000000;
z=0x90000000;
x=y+z; // overflow will occur here, result will be incorrect.
can we check for it like this?
68. unsigned int x,y,z;
....
if (y+z>0xFFFFFFFF)
error("overflow");
x=y+z;
Correct way:
unsigned int x,y,z;
....
if (y>0xFFFFFFFF-z)
error("overflow");
x=y+z;
57
69. Exercise: what about signed types?
A: you will need to check both for “positive” overflow (adding
two large
positive number) and for the “negative;; overflow (adding two
large nega-
tive numbers).
In assembler, flags report overflow condition – no need for
extra check-
ing!
3.6 The SUB instruction
SUB dst,src (dst -= src)
(proper name should be decrement by.)
General 2-operand instruction layout applies:
SUB r m i
r X X X
m X × X
i × × ×
70. Given that syntax of SUB is identical to ADD, syntax examples
are similar
and omitted.
ADD AX,100
SUB AX,-100 ; same as above
;
ADD AX,-100
SUB AX,100 ; same as above
What do these instructions do?
ADD AX,0
SUB AX,0
58
What does this instruction do?
71. SUB EAX,EAX
Answer: most efficient way to zero up a register.
What is the difference between the two instructions below?
SUB EAX,EAX
MOV EAX,0
Answer: the former is more efficient; the latter is rarely used,
only in
the situations when flags must be preserved. (an example,
involving an if,
will be given later.)
Revising example we saw above, more efficient code:
int x,y,z; x=y=z=0;
-------------------
SUB EAX,EAX
72. MOV x,EAX
MOV y,EAX
MOV z,EAX
with SUB, Carry flag indicates borrowing.
59
3.7 The INC instruction
INC dst (dst++)
Do we write
ADD AX,1
ADD byte ptr [10],1
A: yes, we can, but usually we would use the optimized form
General 1-operand instruction layout applies:
73. INC r m i
X X ×
(Same format applies to three more instructions, explained
later).
Register form is optimized to one-byte encoding:
40h INC AX
41h INC CX
42h INC DX
43h INC BX
44h INC SP
45h INC BP
46h INC SI
47h INC DI
Other forms of INC are encoded in lengthier way beginning
with 0FFh
74. and 0FEh.
Warning: this encoding applies to BOTH 16 and 32 registers!
What is better?
60
INC AX
INC AX
;or
ADD AX,2
A: former. But do not do this with memory arguments.
61
3.8 The DEC instruction
75. DEC dst (dst- -)
DEC r m i
X X ×
Comments on INC above are applicable.
Optimized form:
48h DEC AX
49h DEC CX
4Ah DEC DX
4Bh DEC BX
4Ch DEC SP
4Dh DEC BP
4Eh DEC SI
4Fh DEC DI
76. Other forms of DEC are encoded in lengthier way beginning
with 0FFh
and 0FEh.
62
3.9 The NEG instruction
NEG dst (dst=-dst)
NEG r m i
X X ×
How to negate without using NEG?
NEG EAX
;is the same as
SUB EBX,EBX
SUB EBX,EAX
77. MOV EAX,EBX
Solve equation x = −x ?
MOV AL,x
NEG AL
if AL did not change, does this mean x is 0?
No, it is either 0 or 128!
For short, the solutions are 0 and 215=8000h, for 32-bit they are
....
63
#include <stdlib.h>
#include <stdio.h>
int main() {
short s,s1;
78. s=0; s1=-s; if (s!=s1) printf("for s=%d, changes;n",s);
else printf("for s=%d, does not change;n",s);
s=1000;s1=-s; if (s!=s1) printf("for s=%d, changes;n",s);
else printf("for s=%d, does not change;n",s);
s=0x8000;s1=-s; if (s!=s1) printf("for s=%d, changes;n",s);
else printf("for s=%d, does not change;n",s);
return 0;
}
}
results in
for s=0, does not change;
for s=1000, changes;
79. for s=-32768, does not change;
3.10 The CMP instruction
CMP dst,src (dst-src)
General 2-operand instruction layout applies:
CMP r m i
r X X X
m X × X
i × × ×
Conditionals (if,while) are implemented in two stages:
• compute flags
• do (or not) the operation depending on a flag(or flags) – this is
later.
64
Examples:
80. ; if (x==0) ...
;
; compute ZF from value of x
; if (x!=0) ...
;
; compute ZF from value of x
; if (x<0) ...
;
; compute SF from value of x
; if (x<=0) ...
;
; compute SF and ZF from value of x
What about
81. ; if (x==y) ...
;
;
to set the flags, we compute the difference
; if (x==y) ...
;
MOV EAX,x
SUB EAX,y
; this sets ZF for our use
65
; if (x<y) ...
82. ;
MOV EAX,x
SUB EAX,y
; this sets SF for our use
; if (x<y) ...
;
MOV EAX,x
SUB EAX,y
; this sets SF and ZF for our use
Problem: this computation destroys value of x which is often
needed.
So, instead:
; if (x<y) ...
;
83. MOV EAX,x
CMP EAX,y
; this sets SF and ZF for our use
Flags are set exactly as for SUB, but EAX retains the value of
x.
; if (x==5) ...
;
MOV EAX,x
SUB EAX,5 ; inefficient
;
SUB x,5 ; worse yet: destroys variable
;
CMP x,5 ; fine
84. 66
3.11 Logical or Bitwise
What kind of AND operation should be implemented in
hardware? (same
question of course can be asked of OR etc but we will
concentrate on AND)
• logical AND (in C, &&) ?
• bitwise AND (in C, &) ?
• both ?
Logical AND works with True and False concepts:
1 && 1 = 1
1 && 0 = 0
0 && 1 = 0
85. 0 && 0 = 0
Q: what about 5 && 6 ?
(5 is first converted to 1 (true), 6 is converted to 1 (true), we
compute
1&&1 and get result 1).
This is a multi-step operation!
Bitwise AND: apply the AND operation to every bit (every
column):
0101 == 5
& 0110 == 6
----
0100 == 4
Thus 5 & 6 = 4.
Notice that on single bits (or multibit 0 and 1), the results of &
and &&
86. are identical.
1 & 1 = 1
1 & 0 = 0
0 & 1 = 0
0 & 0 = 0
67
We summarize:
• On appropriate logical values, logical and bitwise AND are the
same.
• Logical AND is a multistep operation – hard to implement in
hard-
ware.
• Bitwise AND can be done easily and efficiently (array of
AND-gates)
87. • on non-standard input values, logical AND offers minimal
advantages
(how often do we care about 5&&6?)
• on non-standard input values, bitwise AND offers huge
advantages:
– masks
– sets
– images
3.11.1 Bits and Masks
Most x86 2-operand instructions are encoded as follows:
o p c o d e d w m d r e g r - m .....
for example, for ADD, opcode is 000000, MOV, opcode is
100010, etc
(Encoding for forms with accumulator and immediate are
similar).
The w field states if the operation is word or byte. Register
88. reg=000
means AX if w==1 and AL if w==0. The d field states if the
direction is from
register (0) or to register(1). So decoding should retrieve these
bytes.
currentbyte=code[ip++];
....
wordop=currentbyte & 1;
from_reg=(currentbyte & 2)!=0;
....
68
alternately one can use division and mod (yuck).
3.11.2 Checking for odd/even
89. Q: how do we check for a number being odd or even?
if (x is odd) { ... }
69
if (x % 2 == 1) { ... }
better:
if (x % 2) { ... }
much better:
better:
if (x & 1) { ... }
Q: how do we check for a number is divisible by 4?
if ((x & 3)==0) { ... }
if (!(x & 3) { ... }
90. Q: how do we check for a number is divisible by 2k?
if ((x & 2k−1)==0) { ... }
70
3.11.3 Sets
How do we represent a set?
Consider {2,3,5,7,11,13}
Idea #1: Representation as a link list:
2 ⇒ 3 ⇒ 5 ⇒ 7 ⇒ 11 ⇒ 13 ⇒ #
Idea #2: Representation as an array:
2 3 5 7 11 13
Idea #3: Representation as a bit array (set)
0 0 1 1 0 1 0 1 0 0 0 1 0 1 0 0
91. Total space : 2 bytes.
Search time:
• list Linear
• array Linear or log (binary search)
• set Constant!
Exercise: write the formula
If A,B are similar sets, one can use & to compute the
intersection:
unsigned char A[N];
unsigned char B[N];
unsigned char C[N];
for (int i=0;i<N;i++) C[i]=A[i]&B[i];
Set representation is not always the right one, consider
{1,10000000}!.
92. 71
3.11.4 Images
Consider images.
Actual representation (b/w image): one dimensional bit array.
What would this produce?
&
Conclusion: Bitwise AND is a much better choice for assembler
implen-
tation.
72
3.12 The AND instruction
AND dst,src (dst &= src)
AND r m i
93. r X X X
m X × X
i × × ×
Syntax is the same as ADD,SUB,...
Examples:
MOV AX,5
AND AX,6 ; AX is 4
AND AX,1 ; check evenness
AND AX,7 ; what does this do?
AND AX,0 ; what does this do?
AND AX,AX ; what does this do?
AND AX,CH ; what does this do?
3.13 The TEST instruction
TEST dst,src (dst & src)
94. TEST is non-descructive AND, cf. CMP/SUB.
TEST r m i
r X X X
m X × X
i × × ×
73
Syntax is the same as AND, ADD, SUB,...
Encoding is less efficient than for other operations – TEST is
less com-
mon.
TEST is helpful when we want to extract different bits from the
same
number.
3.14 The OR instruction
OR dst,src (dst |= src)
OR r m i
95. r X X X
m X × X
i × × ×
Syntax is the same as ADD,SUB, AND...
In C, bitwise or is coded as |. Hardware implements the bitwise
form.
Examples:
MOV AX,5
OR AX,6 ; AX is 7
0101 == 5
| 0110 == 6
----
0111 == 7
OR AX,0
(set flags – not as efficient as RR form)
96. OR AX,AX
74
(set flags – just as efficient as RR form of AND)
OR AX,1
(set last bit to 1; numerically round up to an odd number.)
OR AX,2
(set next to the last last bit to 1.)
OR AX,3
(set last two bits to 1)
For sets (and graphics) OR implements union.
3.15 The XOR instruction
XOR dst,src (dst ˆ= src)
97. OR r m i
r X X X
m X × X
i × × ×
Syntax is the same as ADD,SUB, OR...
Truth table for XOR:
1 XOR 1 = 0
1 XOR 0 = 1
0 XOR 1 = 1
0 XOR 0 = 0
75
XOR AX,AX
(clear register – just as efficient as RR form of SUB)
98. XOR AX,0
(only sets flags)
XOR AX,1
(toggle last bit)
XOR AX,7
(toggle last three bits)
XOR AX,0xFFFF
(toggle all bits)
Involution property of XOR:
((d XOR k) XOR k) == (d XOR (k XOR k)) == (d XOR 0) == d
Application #1: temporary lines:
(same idea can be used for the mouse pointer or inverting
selected text).
76
99. Application #2: cypher:
((data XOR key) XOR key) == data
char data[N];
for (int i=0; i<N; i++)
data[i]=data[i] ^ key;
In pure form, variation of a substitution cipher:
char data[N];
char subst[256]; // contains permutation of ASCII chars
for (int i=0; i<N; i++)
data[i]=subst[data[i]];
Not secure, but good for puzzles! Cryptogram puzzles.
With key chaining:
char data[N];
100. for (int i=0; i<N; i++) {
char c;
c=data[i];
data[i]=c ^ key;
key=f(key,c);
}
77
https://api.razzlepuzzles.com/cryptogram
nearly unbreakable; cf. the Type 1 font encryption episode.
The unbreakable Adobe encryption was actually implemented
with this
code:
unsigned short int r; // this is the key
101. unsigned short int c1 = 52845;
unsigned short int c2 = 22719;
unsigned char Encrypt(plain) unsigned char plain;
{unsigned char cipher;
cipher = (plain ^ (r>>8));
r = (cipher + r) * c1 + c2;
return cipher;
}
(Naturally, there are myriads of small changes that can be made
to the
code, beginning with change of the constants... and they would
lead to
equally strong security!)
Question: why is C not providing ^^ ?
3.16 The NOT instruction
102. NOT dst (dst= d̃st)
NOT r m i
X X ×
Follows the format of INC,DEC, NEG.
Example:
XOR AX,0xFFFFh
NOT AX ; same result as above ?
Warning: NOT may produce drastically different results
depending on
the size of the number!
78
#include <stdlib.h>
#include <stdio.h>
103. int main() {
unsigned short s; unsigned char c; unsigned int i;
c=1; c=~c; printf("n negated byte %u ",c);
s=1; s=~s; printf("n negated short %u ",s);
i=1; i=~i; printf("n negated int %u ",i);
}
results in
negated byte 254
negated short 65534
negated int 4294967294
3.17 Shift and Rotate operations
This group of instructions allows to shift bits in a register in
different ways.
The format of all instructions in the group is the same; we will
104. start by
considering the
79
3.17.1 SHL
SHL r/m,amount (SHift Left)
instruction. The first argument of the instruction must be a
register or a
memory location; the second specified the number of positions
the bits of
the first argument will be shifted. There are three ways the shift
amount
can be specified
1. as number 1
2. as register CL (no other register is allowed)
3. as integer number greater than one. We separate this case
105. from the
first one since the encoding is different and this format did not
exist
in the original 8086 (added with 80186).
Here is a simple example of the SHL execution:
MOV AL,11
SHL AL,1
to see what is the resulting value in target, let’s write AL in
binary:
0 0 0 0 1 0 1 1
Shifting the bits left by one position would result in
0 0 0 1 0 1 1 0
where the leftmost bit (0) is moved out and a zero is written to
the last
position. Numerically the value is 22.
It is not an accident that 22 = 11 × 2; the effect of appending a
106. zero
to a binary number is multiplication by 2 (just like appending a
zero to a
decimal number is multiplication by 10). Unlike the “generic”
multiplica-
tion described in a later section, this is an efficient and cheap
operation –
moving bits is naturally cheaper than invoking the multiplier.
80
If the first bit of the number were 1, pushing it out of the
register will
result in “an overflow” and an incorrect result. For example,
shifting 128
in an 8-bit register will produce 0. (Notice that the correct value
of 256 is
impossible on an 8-bit register.)
The first bit, incidentally is not lost – it is moved to the Carry
flag;
checking the carry flag therefore offers a way to detect an
overflow and
107. possibly correct the calculation.
CF= 0 0 0 0 1 0 1 1 0
The other two forms of SHL arguments allow to shift the
register by
more than one position. Consider
MOV AL,11
SHL AL,3
or
MOV AL,11
MOV CL,3
SHL AL,CL
Both snippets will do the same : shift AL by 3 positions to the
left. Three
leftmost bits are pushed out, three zeros are appended at the
end, changing
108. CF= ? 0 0 0 0 1 0 1 1
to
CF= 0 0 1 0 1 1 0 0 0
It is an efficient way to multiply by 8 = 23,and in general left
shift by n
positions is equivalent to multiplication by 2n.
The bits pushed out cannot be all placed into the carry; two of
them are
irreversibly lost and only the last bit to be pushed out ends up
in the carry
flag. In the previous example the bit ending up in the carry is
shown in
red.
81
Unless an overflow occurs, SHL multiplication often works
correctly even
with negative numbers. For example, -1 (binary 11111111
109. becomes 11111110,
which is indeed -2).
This instruction does not change either zero or sign flags.
NOTE: shift instructions like SHL existed on many processors
even be-
fore Intel and got incorporated into the C language ( << ). Thus,
one can
write in C either x=y+y; or x=y<<1; instead of slower x=y*2.
Which of the
two options is better? In most cases, about the same. The right
choice also
depends on the particular formula. For instance
y=f(x)*2; // function call
should not be even changed to y=f(x)+f(x); since this would
result in a
double call of a function! – definitely slower and possibly a
bug, if the
function has side effects. Using the shift here is certainly fine.
On the other
hand, some languages do not even offer shift operator making
the addition
110. is the only alternative to (slower) multiplication
82
3.17.2 SHR
SHR r/m,amount (SHift Right)
instruction has the same syntax as SHL but shifts the bits to the
right. We
again look at an example:
MOV AL,11
SHR AL,1
In binary, 11 is
CF= ? 0 0 0 0 1 0 1 1
shifting this pattern right results in the last bit (1) expelled from
the register
into the carry flag, the remaining bits moving right by one
111. position and a
zero written into the leftmost position, obtaining
CF= 1 0 0 0 0 0 1 0 1
(decimal 5). It is not an accident that 5 = 11/2, for unsigned (or
nonnega-
tive) numbers shifting right by 1 is equivalent to division by 2,
and shifting
right by n bits is division by 2n. (Analogy: with decimal
numbers deleting
the last digit is the same as division by 10).
when shifting by more than one it is the last bit that shifted-out
from the
register that ends up in the carry.
83
One more example to consider :
MOV AL,255
SHR AL,1
112. the result of shifting
CF= ? 1 1 1 1 1 1 1 1
is
CF= 1 0 1 1 1 1 1 1 1
in terms of division we see 127 = 255/2, correct. However, in
MOV AL,-1
SHR AL,1
we see the same exact values (255 is -1), but obviously
incorrect result:
127 = −1/2. In fact the results will be always incorrect when
shifting a
negative number: after the shift the sign bit (1) is replaced by 0!
NOTE: in C/C++ bitshift operators << and >> use SHL and SHR
for un-
signed (or signed but positive) numbers. Applying them to
negative num-
113. bers is supported by syntax but unpredictable, the result may or
may not be
correct, depending on the compiler (it is correct under BCC).
C/C++ syn-
tax allows negative values of shift, but the results are again
unpredictable
(under BCC, the result would be always 0).
Division of negative numbers via a shift is, however, possible in
assem-
bler with the
84
3.17.3 SAR
SAR r/m,amount (Shift Arithmetic Right)
instruction. SAR is nearly identical to SHR; the only difference
is that SHR
inserts zeros at the left, SAR duplicates the first bit. Thus
number 255
(11111111)
114. CF= ? 1 1 1 1 1 1 1 1
would become
CF= 1 0 1 1 1 1 1 1 1
after a SAR (as seen above), but after SHR it is
CF= 1 1 1 1 1 1 1 1 1
where the first 1 is copied to the right but also left where it was.
SAR produces correct results for division of signed/negative
numbers: the
calculation in the example above is division of -1 by 2,
resulting in -1 –
correct result assuming rounding to −∞.
NOTE: Java implements both SAR and SHR as >> and >>>.
85
3.17.4 SAL
115. For symmetry, Intel also has
SAL r/m,amount (Shift Arithmetic Left)
while SAL is named and even encoded differently, it has the
same function-
ality as SHL.
86
3.17.5 ROL
ROL r/m,amount (ROtate Left)
rotates bits leftward, the high bit(s) pushed out appear on low
end.
MOV AL,79h ; AL is 01111001
ROL AL,1 ; AL is 11110010
ROL AL,1 ; AL is 11100101
116. MOV CL,2
ROL AL,CL ; AL is 10010111
ROL never loses any bits and does not use the Carry flag.
Notice that
ROL AL,8
does not do anything, yet another equivalent of NOP. Likewise
ROL AX,16
does not change any registers.
Exercise: What does ROL AX,8 do?
87
3.17.6 ROR
ROR r/m,amount (ROtate Right)
117. is similar, except that the bits are rotated right. Notice that the
two instruc-
tions below are equivalent.
ROL AL,3
ROR AL,5
ROL and ROR are rarely used, albeit this example should be of
interest:
ROL word ptr [100],8
88
3.17.7 RCL
RCL r/m,amount (Rotate Carry Left)
rotates the bits in register and the carry flag left. High bits
being rotated
out are moved to the carry while the carry flag is entered on the
118. low end
of the register.
MOV AL,79h ; AL is 01111001 CY=c (undefined)
RCL AL,1 ; AL is 1111001c CY=0
RCL AL,1 ; AL is 111001c0 CY=1
RCL AL,1 ; AL is 11001c01 CY=1
rotation by more than one can be seen as repeated rotations by
1, except
done as a single instruction.
89
3.17.8 RCR
RCR r/m,amount (Rotate Carry Right)
is similar to RCL except for the direction of the rotation.
119. !!! The code below uses some assembler syntax that has not
been cov-
ered yet !!!
While RCL and RCR may appear strange but they have a
number of us-
ages. One of them is the ability to shift (that is multiply or
divide by a
power of two) numbers that are larger than the register size.
Assume that x
is a 100-byte (800-bit) number stored in memory locations [500]
through
[599]. We would like to multiply it by 2. Remember that the
bytes are
stored in reverse order.
SHL byte ptr DS:[500],1 ;high bit is moved to the carry
RCL byte ptr DS:[501],1 ;carry to 2nd byte, high bit to carry.
RCL byte ptr DS:[502],1 ;carry to 3rd byte, high bit to carry.
...
RCL byte ptr DS:[599],1
120. or, as a loop:
CLC
MOV BX,500
MOV CX,100
L: RCL byte ptr DS:[BX],1
INC BX
LOOP L
90
Exercise: How would you divide this 800-bit number by 2?
Exercise: Would signed and unsigned numbers use the same
code?
to multiply by 2k, you need to shift by k positions. This must be
121. done as
1-position shifts repeated k times since the carry can hold only
one bit,
therefore a double loop is required.
another application for these instructions is shifting a
monochrome image
by one or several pixels. The code is similar to the one used for
shifting
long numbers above.
first two forms of all eight shift/rotate instructions are encoded
to-
gether, using bytecodes D0 through D3. The “register” field in
the 2nd byte
of the encoding actually denotes the operation. The third form
uses C0 and
C1, in a similar fashion.
80386 and all newer processors have double-register-shift
operations
SHLD and SHRD that allows to write more efficient code; we
do not describe
them here.
122. 91
3.18 Data conversion
The arithmetic instructions studied in the previous sections are
sufficient to
implement many formulas, but not all. One particular problem is
mixing of
variables of different sizes, as seen in the following C-language
example
int x;
short y;
char z;
x = y+z;
If all three variables were declared the same way (for instanc e,
all three
int’s), we could have implemented the computation with already
known
123. instructions:
MOV EAX,y
ADD EAX,z
MOV x,EAX
but the size must match rule prevents us from mixing operands
of
different size. Thus, we are in need for data conversion
instructions – ones
that change the size of a variable without changing the value.
Before showing the available instructions, let’s understand the
exact
problem.
In the example above we would like to add byte (char) quantity
z with
word (short) quanity y. To do this, we need to convert z to also
a word
quantity that has the value equal to the original value of z, this
word quan-
tity can be added with y. (We would later need to convert the
124. sum – a word
quantity – to a doubleword, to be able to store it into x).
We actually have a way to do this conversion in some cases. For
exam-
ple, if we assume z to be equal to 1, the following assembler
code would
work:
MOV AL,z
SUB AH,AH
Bingo: by zeroing-up the high bits in the AX register, we
extended the
value into a word, and can add it with the value of y:
92
MOV AL,z
SUB AH,AH
125. ADD AX,y
Assuming y is equal to 3, the result would be 4, stored in AX.
To save it
into x we can append two more zero bytes:
MOV AL,z
SUB AH,AH
ADD AX,y
MOV word ptr x,AX
MOV word ptr x[2],0
ZERO EXTENSION, used in the example, is in fact a correct
solution, just
not for the our formula; with diffently chosen numbers it will
fail. Consider
int x;
short y=1;
126. char z=-1;
x = y+z;
ZERO extending z as we did above will result in the value 255
in the
AX; x therefore will be computed as 256.
Why is this happening? ZERO EXTENSION always produces
non-negative
numbers, even if the original value had the sign (high) bit set,
the extended
value will always have zero in the high position. Thus, it is
suitable for un-
signed data types, the example
unsigned int x;
unsigned short y;
unsigned char z;
x = y+z;
will work fine with any values of x and y. For signed data types
127. we need
a different type of extension. For simplicity, let’s look only at
conversion
AL ⇒ AX. The value in AL is considered as signed. If it
happens to be
nonnegative, we can ZERO-EXTEND, as we did above
SUB AH,AH
93
if it happens to negative, then we should instead fill AH with
1s, not 0’s:
MOV AH,0FFh
In general, the extension byte(or bytes) will be filled with the
sign of the
number, the procedure itself is therefore called SIGN
EXTENSION.
One practical difference between the two types of extensions:
while
128. the unsigned (ZERO) extension can be implemented efficiently
with the
instructions we already have, an implementation of the signed
extension
would require coding the logic like
if (AL>=0) AH=0; else AH=255;
can this pseudocode be written in assembly? Yes, of course. Can
it be
written efficiently? No.
Exercise: Implement the pseudocode above in assembler.
Therefore, the instruction set provides sign conversion
instructions which
we will now introduce
CBW (Convert Byte to Word)
CBW, just like other instructions in this group, does not have
arguments,
they all work on the accumulor. CBW specifically converts AL
to AX, using
sign extension.
129. CWD (Convert Word to Doubleword)
CWD converts 2-byte value stored in AX into 4-byte value
stored on DX and
AX, DX has the low bits.
Why not convert to a single doubleword register? Two reasons:
firstly,
CWD was added before the processor had 32-bit registers, but
see CWDE below.
94
Secondly, in some cases this is actually more convenient; see
the examples
related to division below.
Our example program, written in assembler, would use both of
these
instructions:
MOV AL,z
130. CBW
ADD AX,y
CWD
MOV word ptr x,AX
MOV word ptr x[2],DX
The following two instructions did not exist in the original 8086
and
were added only in 80386, when 32-bit registers were first
introduced:
CWDE (Convert Word to Doubleword Extended)
sign-extends AX into EAX.
CDQ (Convert Doubleword to Quadword)
sign-extends EAX into EDX:EAX (to store a quadword on a
single register
one would need a 64-register!)
131. The four forms given above are the most efficient to use and the
code
should be written in such a way as to have the data to be sign-
converted
on the accumulator; unsigned data conversion, on the other
hand, can be
done on other registers:
MOV BL,z
SUB BH,BH ; now BX has zero-extended value of z.
In addition to these instructions, Intel – beginning with 80386 –
offers
more general forms
95
MOVSX target,source (MOV with Sign eXtension)
MOVZX target,source (MOV with Zero eXtension)
132. where target and source are either memory or registers, at least
one of
the operands must be a register ( memory-to-memory operations
are not
allowed ), and – unlike the usual “size must match” rule, here
the target
should be larger than the source.
For example:
MOVSX AX,AL
MOVZX EBX,DL
MOVSX word ptr DS:[10],BL
MOVZX EAX,byte ptr DS:[BX]
Notice that the first instruction in the example is equivalent to
CBW – in
terms of what it does! – but its encoding is 4 bytes vs only 1
and execution
is slower. Thus, the accumulator forms are still the most
efficient.
133. Encoding: CBW is encoded as a single byte 0x98, CWD as a
single byte
0x99.
Data conversion instructions are not considered to be
arithmetic, there-
fore no flags are altered.
WARNING: X86 emulator does not support 32-bit instructions,
so CWDE,
CDQ, MOVSX, MOVZX would be rejected by it. Maybe one
day :P
Now, what about conversion from a larger data size to a smaller
one?
Such situations surely happen, for example:
short a, int b;
a=b;
96
134. Well, in general this cannot be done: the data range of int is
larger
than that of short, so not every int value can be correctly
represented as
a short. Most compilers would issue a warning and proceed by
saving only
the low word:
MOV EAX,b
MOV a,AX
half of the EAX register is not saved at all! This may or may
not work
correctly, depending on the value of b, and assembler cannot
solve what is
impossible to solve. Assembler, however, offers a simple test
that checks if
the operation will work correctly:
MOV AX,word ptr b % lo half of b
MOV BX,word ptr b+2 % hi half of b
CWD
135. CMP DX,BX
JNZ error
MOV a,AX
In other words, if the true value of b can be recovered from the
lo half,
the number fits in the short range. For unsigned types, the check
is even
simpler
CMP word ptr b+2,0 % hi half of b
JNZ error
MOV AX,word ptr b % lo half of b
MOV a,AX
3.19 Multiplication and Division
These two operations were left to the end, because of the
differences from
136. other operations.
One of the differences is the higher cost: multiplication is
slower (processer-
dependent, but 3 times slower is generally a correct estimate)
from other
operations; division is yet slower. (compare this with
performing the op-
eration on paper: addition and subtraction are easy,
multiplication is more
97
difficult, and if forgot how unpleasant division is, review your
elementary
school notes!).
Because of this one should always try to use these operations
sparingly,
or – if possible – not use them at all. For example,
multiplication by a
power of 2 should never be done as multiplication, same goes
for division
137. by a power of 2. This is far from the only possible savings.
Another problem – specific to multiplication – is the growth of
the
length of the result. With binary operations studied so far the
lenght of
the result is the same as the length of the arguments (AND, OR,
. . . ) or
longer by only one bit which can be saved in carry (ADD, SUB,
. . . ). Multi-
plication may double the lentgh – this requires a different type
of syntax,
and different understanding of the “size must match” rule.
(With division,
we will see yet another form of syntax and a direct violation of
the “size
must match” rule.)
Further, not every register is capable of doing these operations.
The
most common/usable forms require the accumulator (or work
best on the
accumulator).
Finally, multiplication and division provide different operation
138. for signed
and unsigned arithmetic.
With this in mind, let’s look at a sample instruction:
IMUL r/m (Integer (signed) MULtiplication)
We notice that while multiplication operation requires two
arguments,
the format specifies only one. The register or memory operand
is multiplied
by the accumulator of the same size as the operand. For
example:
IMUL CL ; multiply CL by AL (8 bit)
IMUL word ptr DS:[BX]; multiply memory word by AX (16 bit)
IMUL ESI ; multiply ESI by EAX (32 bit)
the result is saved in the extended accumulator, it is AX for 8
bit operands,
DX:AX for 16 bit arguments and EDX:EAX for 32 bit
arguments. In all
cases the length of the extended accumulator is twice the length
139. of the
arguments. This assures that the result is computed correctly,
but does not
98
suggest what to do with the result if it is too large to work with
(64 bit in a
32-bit program, or 128 bit after a 64-bit multiplication that also
exists.)
Example:
MOV AL,7
MOV CL,13
IMUL CL ; results in AX being 7*13=91
MOV AL,255
MOV CL,255
140. IMUL CL ; results in AX being 1.
in the second multiplication the arguments are interpreted as
signed num-
bers, so we are actually squaring -1!.
(Intel opcode names are not always consistent. IMUL uses the
word Integer
to mean Signed! Unsigned multiplication that we look at next is
also an
integer multiplication, all the x86 registers are integer!
The unsigned counterpart is
MUL r/m ((unsigned) MULtiplication)
and follows the same format as IMUL. The results, however,
may be differ-
ent:
MOV AL,7
MOV CL,13
MUL CL ; results in AX being 7*13=91
141. MOV AL,255
MOV CL,255
MUL CL ; results in AX being 0xFE01 = 65025
We now turn our attention to division. In addition to the already
men-
tioned, it has one extra feature: division is not always possible.
But let us
begin with the format:
99
IDIV r/m (Integer (signed) DIVision)
Once again, some of the arguments are implicit, the explicit
argument
is the Divisor in the operation. The quotent will appear in the
accumulator,
of the same size as the divisor; the divident is taken from the
extended
accumulator. For example:
142. MOV AX,100
MOV BL,7
IDIV BL ; result (14) in AL
...
IDIV word ptr ds:[10]
; DX:AX is divided by memory word, result in AX
...
IDIV ESI ; EDX:EAX is divided by ESI, result in EAX
Division operations compute the remainder, which is stored in
the 2nd
half of the extended accumular (AH, DX, EDX ), this makes
separate MOD
operation unnecessary.
DIV r/m ( (unsigned) DIVision)
143. it the unsigned counterpart of IDIV
Exercise: find inputs for which IDIV and DIV produce different
results.
In some cases division cannot be performed and results in an
exception
(often meaning a program crash or even an OS crash:
SUB CX,CX
IDIV CX
100
division by zero is the primary example of it, the above example
will fail
regardless of the value of the divident.
In case of a division overflow (division by zero is one example
of it,
but not the only one), the hardware will call INT 0 (see
explanation of
144. interrupts and handlers which is TBW). What happens next fully
depends
on the installed interrupt handler; the outcome may be an error
message
and program termination, crash of the operating system, or a
normal exit
from the program with a error message like your program has
performed an
illegal operation.
Division overflow may occur with a non-zero divisor too:
MOV AX,1000
MOV CL,2
IDIV CL
While we are dividing by 2, this division cannot be done:
1000/2 is 500,
too large for a byte register (AL). The outcome is the same as
with division
by zero, and in fact some exception handlers may report this as
a division
by zero – it is not.
145. In the next example, division result (same 500) is to be placed
into 2
byte accumulator.
MOV AX,1000
MOV CX,2
IDIV CX
this appears possible, but an overflow still may occur.
Exercise: Understand why
In general to avoid overflows one should use longer registers, a
byte
division is very overflow prone.
IMUL, unlike the other instructions in this group, has additional
formats,
including multiplication by an immediate.
TODO : discuss MULDIV
TODO : discuss DIVMOD
146. 101
3.20 LEA : Load effective address
LEA s,t (Load Effective Address)
s=&t;
LEA is not an arithmeric operations sensu stricta, but it is
closely related
to address computation used in arithmetic instruction.
We begin with a closer look at what arithmetic instructions
actually do,
for example
ADD AX,[BX+SI+10]
Two parts of the computation are:
1. compute the memory address : BX+SI+10
147. 2. retrieve the data from the computed address and add it to the
AX
register.
We notice that our sample instruction actually does three
additions, not
one!
The first part of the computation is exactly what LEA does, and
using
LEA we can break the addition into two steps:
LEA DI,[BX+SI+10]
ADD AX,[DI]
This is an equivalent, and obviously slower computation – we
bring it
for illustration purposes only, not a suggestion for coding.
LEA is more restrictive than other 2-operand instruction in
terms of the
allowed operands: the second operand must be a memory
(otherwise the
concept of an address does not apply; the first argument is
148. therefore always
a register.
LEA, not being an arithmetic instructions, does not affect flags.
The size must match rule does not apply to LEA – it actually
does not
move data. LEA allows the arguments to be of different sizes,
albeit such
forms of the instruction are rarely useful. The following are all
valid exam-
ples with different sizes of the arguments (16 and 32):
102
LEA DI,[BX+SI+3] ; 16 and 16
LEA EDI,[BX+SI+3] ; 32 and 16, 0-extension used
LEA DI,[EAX+EBX+3] ; 16 and 32, truncation
LEA EDI,[EAX+EBX+3] ; 32 and 32
149. With static addresses, LEA does not accomplish anything more
than MOV
can do :
msg DB ’Hello, World!’
LEA SI,msg
MOV SI,offset msg
(but notice the difference in the syntax). In this case, MOV is
slightly
more efficient (3-byte encoding rather than 4) and some
assemblers (TASM!)
will actually compile LEA as MOV!
With addresses that include registers, LEA can be emulated as
two or
three instructions:
nums DD 1000 dup(?)
......
LEA AX,nums[BX]
150. ......
MOV AX,offset nums
ADD AX,BX
One important but rarely known detail is that the ability of LEA
to per-
form “free” additions translates to its ability to do free
multiplications, lead-
ing to a new fast way to multiply by small numbers.
Recall1 that in 32-bit mode the addresses are in the form
[base*scale+index+offset]
with base being any 32-bit register but ESP, index being any
32-bit register,
scale is 1,2,4 or 8.
Now, consider the following example:
1actually this section has not been written!—maybe it will be
yet
151. 103
LEA EAX,[EAX+EAX*2]
This is a valid 32-bit addressing instruction, we can use the
same regis-
ter as the “index” and the “base”. The effect of the instruction
is multipli-
cation by 3, not surprisingly faster than IMUL. Using
LEA EAX,[EAX+EAX*2]
LEA EAX,[EAX+EAX*4]
to multiply by 15 is also faster than IMUL, despite being two
instructions.
LEA EAX,[EAX+EAX*2]
SHL EAX,4
is a faster way to multiply by 12, and so on. For many small
152. numbers there
are similar tricks.
3.21 Long integers: ADC and SBB
Integer arithmetic can be performed on numbers that consists of
more bits
than the register size. This was done already on 8-bit processors
in the Dark
Ages of CP/M; exactly the same approach works on modern 32-
and 64-
processors to work with numbers that are yet longer. While the
technique
is the same, the usability of it is less now since for most
computations even
32-bit integers are sufficient.
We will first work out the technique mathematically using 8-bit
registers
only, while trying to add 16-bit numbers. (The techique is fully
scalable, the
choice of small register sie is to make the examples easier to
understand.)
Consider this calculation:
153. short x,y,z;
z=x+y;
Using 16-bit registers we can write it in assembler as
; program 1
MOV AX,x
ADD AX,y
MOV z,AX
104
We now assume that the 16-bit registers are not available and
try to perform
it using 8-bit registers only. How about this code?
; program 2
154. MOV AL,byte ptr x
ADD AL,byte ptr y
MOV byte ptr z,AL
MOV AH,byte ptr x[1]
ADD AH,byte ptr y[1]
MOV byte ptr z[1],AH
for some values it will indeed work. For example, for x=1234h
and y=5678h,
the result would be 68ACh, same in both programs:
1234h 12h 34h
+ 5678h + 56h 78h
----- -------
68ACh 68h ACh
For others, the result would different. If x=80h (128 decimal)
155. and y=80h,
the 16-bit program will produce 100h (256 decimal) correctly,
whereas the
8-bit program above will produce 0: adding 80h with itself on
an 8-bit
register gives 0!.
0080h 00h 80h
+ 0080h + 00h 80h
----- -------
0100h 00h 00h
The source of the problem is the carry produced by the first 8-
bit addi-
tion: we are not using it. 80h+80h indeed results in 8 zero bits,
but also
a carry, indicating an extra one to be added to the 9th bit – but
we are
ignoring it.
It is actually possible to correct the above program without
introducing
156. new instructions:
; program 3
MOV AL,byte ptr x
ADD AL,byte ptr y
105
MOV byte ptr z,AL
MOV AH,byte ptr x[1]
JNC skip
INC AH
skip:
ADD AH,byte ptr y[1]
MOV byte ptr z[1],AH
157. (the INC instruction adds the previously ignored carry), but the
Intel in-
struction set has a shortcut that would result in a faster code:
; program 4
MOV AL,byte ptr x
ADD AL,byte ptr y
MOV byte ptr z,AL
MOV AH,byte ptr x[1]
ADC AH,byte ptr y[1]
MOV byte ptr z[1],AH
The new ADC instruction adds the 2nd argument and the carry
to the
1st argument – this is exactly the computation needed.
ADC t,s; t=t+s+carry (ADd with Carry)
158. Except for the addition of the carry, ADC follows exactly the
same rules as
ADD, including the syntax and setting of the flags.
We now scale the example: how about adding 128-bit numbers?
This
cannot be done with a single addition on any of the current
processors –
you would need 128-bit registers for that!
In this example we will assume that x occupies 16 bytes (128 =
16 ×
8) of memory beginning with address 100, y and z (same size
both) are
located at 200 and 300. We further assume that the bytes in out
128-bit
integers are written in the Intel reversed order. The objective is,
as before,
to compute z=x+y.
How about this?
106
160. MOV dword ptr z[12],EAX
notice that this computation could have been done using only
two 64-bit
additions, or using 8 16-bit additions, or using 16 8-bit
additions. The code
would look similar in all cases; naturally the code using longer
registers
will have fewer instructions and would run faster.
The code shown in the previous program can be extended to
handle
interegs of any size. It may be preferrable, however, to write
this code as
a loop. We notice that all the editions are done using the ADC
instruction,
except for the very first one, done by ADD – this is because
there is no carry
on the very first addition. We can use ADC for all additions if
we are certain
that the initial value of the carry flag is 0. The loop form of the
previous
program thus becomes:
; program final??
161. CLC ; clear carry
MOV CX,4
SUB BX,BX
L: MOV EAX,dword ptr x[BX]
ADC EAX,dword ptr y[BX]
MOV dword ptr z[BX],EAX
ADD BX,4
LOOP L
NOTE: the program above should be seen just an idea; to make
it work
one needs to ensure that the carry flag is not corrupted by other
instruc-
tions! – this is left to the reader.
The new instruction
162. 107
CLC (CLear Carry)
sets carry to 0, thus allowing us to use ADC for the first loop
iteration.
We may as well introduce two other instructions that also alter
the carry
flag.
STC (SeT Carry)
set carry to 1
CMC (CompleMent Carr)
toggle the carry bit
Yet longer numbers can be handled with the code above by
simply in-
creasing the number of times the loop is executed. The
execution time will
163. be proportional to O(n/r) where n is the bitlength of the
numbers, and r
is the bitlength of the register.
A similar problem for subtraction will not be examined in
details. We shall
only say that the analog of ADC for subtraction is the SBB
instruction:
SBB t,s; t=t-s-carry (SuBtract with Borrow)
and all of the above examples can be adapted to subtraction by
changing
all ADD’s to SUB’s and all ADC’s to SBB’s.
108
3.21.1 24-bit case
As a special example, consider 24-bit (3 byte) numbers.
Exercise: Why would one even want to look at such?
164. Addition of such numbers can be done as three byte size
additions or as
one byte size and one word size.
; program 6
x db 3 dup(?)
y db 3 dup(?)
....
MOV AL,byte ptr x[0]
ADD AL,byte ptr y[0]
MOV byte ptr z[0],AL
MOV AL,byte ptr x[1]
ADC AL,byte ptr y[1]
MOV byte ptr z[1],AL
MOV AL,byte ptr x[2]
165. ADC AL,byte ptr y[2]
MOV byte ptr z[2],AL
Notice that even in column addition of decimal numbers, chunks
do not
need to be of the same size:
923
+ 278
---
we can add 23+78, obtaining 01 and carry of 1!
In programs below we break the numbers into byte+word and
word+byte
chunks.
109
166. ; program 6a
x db 3 dup(?)
y db 3 dup(?)
....
MOV AL,byte ptr x[0]
ADD AL,byte ptr y[0]
MOV byte ptr z[0],AL
MOV AX,word ptr x[1]
ADC AX,word ptr y[1]
MOV word ptr z[1],AL
; program 6b
x db 3 dup(?)
y db 3 dup(?)
167. ....
MOV AX,word ptr x[0]
ADD AX,word ptr y[0]
MOV word ptr z[0],AX
MOV AL,byte ptr x[2]
ADC AL,byte ptr y[2]
MOV byte ptr z[2],AL
Exercise: Are 6a and 6b equivalent?
3.21.2 other operations
• Bitwise AND, OR, XOR, NOT simply repeat the same
operation on
each chunk (loops for long data).
• Shifts propagate carry.
168. • NEG is left as a (nice!) exercise.
• (I)MUL is more complicated
• (I)DIV is even more complicated
110
The idea of long MUL:
Assume x,y,z is twice the size of the register that is capable of
multipli-
cation (N-bit long).
Then x=xh ×R + xl, y=yh ×R + yl, where xl,yl are the low
halves of
the values, xh and yh are the high halves, R is 2N.
x∗ y = (xh∗ R+xl)∗ (yh∗ R+yl) = xhyh∗ R2 + (xhyl
+xlyh)∗ R+xlyl)
The formula contains 4 multiplications (multiplications by R are
simply
169. data movements)
Assume now that x and y are 64 bit, z is 128 bit.
x dd ?,?
y dd ?,?
z dd ?,?,?,?
SUB EAX,EAX
MOV z[8],EAX
MOV z[12],EAX
MOV EAX,x[0]
MUL y[0]
MOV z[0],EAX
MOV z[4],EDX
MOV EAX,x[0]
171. MUL EDX
ADD z[8],EAX
ADC z[12],EDX
This seems to required 4 multiplications (and generally m*N
numbers
would require m2 multiplications); in reality only 3 are needed.
(Toom’s
algorith – google for more)
Idea:
x∗ y = (xh∗ R+xl)∗ (yh∗ R+yl) = xhyh∗ R2 + (xhyl
+xlyh)∗ R+xlyl)
requires 4 multiplications, but
x∗ y = (xh∗ R+xl)∗ (yh∗ R+yl) =
xhyh∗ R2+[(xh+xl)∗ (yh+yl)−xhyh−xlyl]∗ R+xlyl)
requires only 3:
• xh ∗ yh
172. • xl ∗ yl
• (xh + xl) ∗ (yh + yl)
Division algorithms
112
https://en.wikipedia.org/wiki/Division_algorithm
Chapter 4
Jumps
4.1 Unconditional Jumps
JMP label (unconditionally JuMP (goto) to label)
where label is any identifier, not a keyword, unique. Some
scope rules
apply.
JMP lab
173. lab:
JMP lab
(Jumps can go both forward and backward, multiple JMP
instructions
may target the same label).
l: JMP l
(fine, but infinite loop)
113
NOTE: label is a keyword, so this id cannot be used. In fact, the
colon
: is an abbreviation for label near.
Other forms of JMP syntax exist, including
JMP r/m (jump(goto) to address stored in r/m)
174. Only a trivial example for now:
MOV AX,offset lab
JMP AX
....
lab:
(offset is a keyword, cf. the “take address” & operator in C).
114
4.2 Conditional Jumps
J<cc> label (Jump based on cond code)
General idea: conditional jumps implement an equivalent of
if (<condition>) goto lab;
where <condition> is usually expressed in flags.
175. NOTE: of course, this is exactly the programming style not
recommended
for high-level languages!
115
4.2.1 Using Zero Flag
if (ZeroFlag) goto lab;
corresponds to an assembler instructio n, JZ.
JZ label (Jump if Zero flag)
Example:
; int x;
;
; if (x!=0)
176. ; { <BODY> }
;
CMP x,0
JZ skip
; <BODY>
skip:
if x is already on a register, CMP may not be needed.
; int x,y,z;
;
; if (x+y+z!=0)
; { <BODY> }
;
MOV EAX,x
177. ADD EAX,y
ADD EAX,z
JZ skip
; <BODY>
skip:
116
JE label (Jump if equal)
is a synomym of JZ
; int x,y;
;
; if (x!=y)
178. ; { <BODY> }
;
MOV EAX,x
SUB EAX,y
JE skip
; <BODY>
skip:
JNZ label (Jump if not Zero flag)
JNE label (Jump if not Equal)
; int x;
;
; if (x==0)
; { <BODY> }
180. ;
MOV EAX,x
SUB EAX,y
JNE skip
; <BODY>
skip:
NOTE: The four (actually, two) instructions JZ, JE, JNZ, JNE
are based
on the Zero Flag only.
NOTE: The N “prefix” in the instruction name is used with
other instruc-
tions.
NOTE: Conditions in conditional jumps are always reversed: we
specify
skipping, not doing!
118
181. 4.2.2 Using Sign Flag
JS label (Jump if Sign)
JNS label (Jump if Not Sign)
; int x,y;
;
; if (x<0)
; { <BODY> }
;
CMP x,0
JNS skip
; <BODY>
182. skip:
119
JL label (Jump if Less)
JLE label (Jump if Less or Equal)
JG label (Jump if Greater)
JGE label (Jump if Greater or Equal)
JNL label (Jump if Not Less)
JNLE label (Jump if Not Less or Equal)
JNG label (Jump if Not Greater)
JNGE label (Jump if Not Greater or Equal)
(There are only 4 different instructions above!)
120
183. ; int x,y;
;
; if (x<=y)
; { <BODY> }
;
MOV EAX,x
SUB EAX,y
JNLE skip ; reverse Less-or-Equal cond
; <BODY>
skip:
121
184. WARNING: the explanation below is simplified, actually JS and
JL are
different, and the Overflow flag plays a role. For nearly all
practical code
the results are the same. For more details (and true picture)
check this.
Why this extra mess? Because CMP may overflow!
How does this work?
Primary instructions: JL (==JNGE), JLE (==JNG), JNL
(==JGE), JNLE
(==JG).
We assume JL ≈ JS. Then
JL jumps if sign flag is set
JLE jumps if sign or zero flag is set
JNL jumps if sign flag is not set
JNLE jumps if neither sign or zero flag is set
186. done:
Fix the error above! The correct code is
; EAX=abs(EAX)
OR EAX,EAX
JNS done
NEG EAX
done:
Example: implement SIGN function
(Sign definition:
sign(x) =
{−1 if x < 0
0 if x = 0
1 if x > 0
Actually
187. ∀ x,x = abs(x) × sign(x)
; EAX=sign(EAX)
OR EAX,EAX
JZ done
123
MOV EAX,1
JG done
NEG EAX
done:
Example:
What does this do?
MOV CX,10
188. SUB AX,AX
l: ADD AX,CX
DEC CX
JNZ l
124
These instructions are not suitable for unsigned arithmetic!
Example
MOV AL,-1
CMP AL,1
JL lab
...
lab:
189. This works correctly, JMP is taken
MOV AL,255
CMP AL,1
JL lab
...
lab:
Counter-intuitively, the jump is taken, this is because 255 = −1
and is
interpreted as such!
Correct way to program this is
MOV AL,255
CMP AL,1
JB lab
...
190. lab:
4.2.4 Using Carry Flag
JC label (Jump if Carry)
JNC label (Jump if Not Carry)
125
MOV EAX,x
ADD EAX,y
JC error ; overflow !
...
error:
126
191. JB label (Jump if Below)
JBE label (Jump if Below or Equal)
JA label (Jump if Above)
JAE label (Jump if Above or Equal)
JNB label (Jump if Not Below)
JNBE label (Jump if Not Below or Equal)
JNA label (Jump if Not Above)
JNAE label (Jump if Not Above or Equal)
Only 4 different instructions here, further, JB==JC, JNB==JNC.
127
Exercise: Work out how the flags are used
192. 4.2.5 Mixing signed and unsigned
Consider two examples:
//example S
int x,y;
if (x<y) ....
//example U
unsigned int x,y;
if (x<y) ....
In Example S, compiler will generate a CMP instruction,
followed by an
inverted signed jump, specifically here JNL.
In Example U, compiler will generate a CMP instruction,
followed by an
inverted unsigned jump, specifically here JNB.
193. This makes it easy for the user — he does not need to think
which
instruction to use. But there is a problem:
//example M
int x; unsigned y;
if (x<y) ....
Using either signed or unsigned instruction will work
incorrectly in
some cases! Compilers would typically issue a warning (f.e.
“signed un-
signed mismatch”, often ignored by the programmer.
How to do this correctly?
Let’s consider a smaller case first.
128
//example MB
194. unsigned char x, /*signed*/ char y;
if (x<y) ....
8-bit comparison will not work. But conversion into a range that
in-
cludes both char and unsigned char will:
//example MB1
unsigned char x, /*signed*/ char y;
short sx,sy;
sx=x; sy=y;
if (sx<sy) ....
or, simply:
//example MB1
unsigned char x, /*signed*/ char y;
if ((short)x<(short)y) ....
195. Since short is a signed datatype, generated instruction will be
signed.
When comparing signed and unsigned shorts, we can convert
both to
int (in fact, we could have used int above too).
But what do we do when comparing signed and unsigned int’s?
In assembler, the above is of course also doable, but so is the
32-bit
comparison!
(The code cannot be shown at this time : we bypassed
conversion in-
structions).
In C a proper comparison function can be written too.
129
4.3 Jumps encoding
196. opc del
where opc is the opcode (0x74 for JE, 0x78 for JS,...) and del is
a signed
byte.
IP is first incremented to point to the next instruction, then, if
the jump
is taken, IP=IP+del;
The short form of JMP is encoded similarly, using the opcode
byte 0xEB
0xEB del
The near form of JMP uses opcode 0xE9 and 2-byte (16-bit
segs) or 4-
byte (32-bit segs) delta.
16 bit seg : 0xEB d1 d2
32 bit seg : 0xE9 d1 d2 d3 d4
If conditional jump cannot reach its target, one can overjump:
197. JZ lab
; more than 128 bytes
lab:
may not compile, but
JNZ temp
JMP lab
temp:
; more than 128 bytes
lab:
130
will.
80386+ has alternative longer forms of conditional jumps
198. instructions,
with 16-bit delta, they begin with bytes 0Fh 8?h ..
4.4 LOOPS
Revisiting the previously seen example:
MOV CX,10
SUB AX,AX
l: ADD AX,CX
DEC CX
JNZ l
We notice that adding numbers 10 down to 1 is more efficient
than
doing 1 to 10.
131
199. MOV CX,1
SUB AX,AX
l: ADD AX,CX
INC CX
CMP CX,11 ; EXTRA
JNZ l
Generally, running loops toward 0 is more efficient!.
We next notice that we actually implemented a do while loop.
; cx=1; ax=0; do { ax=ax+cx; cx++; } while (cx<=10);
What about a while ?
; cx=1; ax=0; while (cx<=10) {ax=ax+cx; cx++; }
MOV CX,1
SUB AX,AX
200. l: CMP CX,10 ; EXTRA
JA done
ADD AX,CX
INC CX
JMP l ; EXTRA
done:
Each execution of while in n iteration loop requires 2n jumps vs
n for
do while.
Generally, do-while is more efficient than while!.
We will not consider for – generally equivalent to while.
Returning to
MOV CX,10
SUB AX,AX
201. l: ADD AX,CX
DEC CX
JNZ l
This is a very common form of a loop, and Intel allows to
optimize this
further.
132
LOOP label (CX–; if (CX>0) JMP label)
Thus:
MOV CX,10
SUB AX,AX
l: ADD AX,CX
LOOP l
202. for one instruction fewer!
We can generalize this into
MOV CX,n
SUB AX,AX
l: ADD AX,CX
LOOP l
But what will happen if supplied n is 0 ?
One solution:
MOV CX,n
SUB AX,AX
OR CX,CX
JZ done
203. l: ADD AX,CX
LOOP l
done:
Again, Intel provides a shortcut,
133
JCXZ label (if (CX==0) JMP label)
Thus:
MOV CX,n
SUB AX,AX
JCXZ done
l: ADD AX,CX
LOOP l
204. done:
LOOP and JCXZ are encoded similarly to conditional jumps.
LOOP’s op-
code is 0E2h, JCXZ’s is 0xE3.
Mentioning only: there are also less commonly needed LOOPZ
and LOOPNZ;
All four instructions can work off ECX.
Double LOOPs should save CX!
MOV CX,n
L1:
....
**SAVE** CX
MOV CX,m
L2:
205. ....
LOOP L2
**RESTORE** CX
LOOP L1
SAVE/RESTORE can be done using another reg (for example,
XCHG CX,DX),
memory, or stack (usually the best).
134
One more “real” example:
MOV CX,100
MOV AX,1
MOV BX,1
L: ; print AX
206. ADD BX,AX
XCHG BX,AX
LOOP L
4.5 Control statements templates
; goto lab;
;
JMP <lab>
lab:
; if <COND> <BODY>
;
; convert condition into a flag
JN<cond> skip
207. <BODY>
skip:
135
; if <COND> <BODY1> ;else <BODY2>
;
; convert condition into a flag
JN<cond> lelse
<BODY1>
JMP done
lelse:
<BODY2>
done:
208. ; while (true) <BODY>
; for (;;) <BODY>
lagain:
<BODY>
JMP lagain
ldone: ; may be used for break in <BODY>
; while <COND> <BODY>
;
lagain:
; convert condition into a flag
JN<cond> ldone:
<BODY>
209. 136
JMP lagain
ldone:
; do <BODY> while <COND>;
;
lagain:
<BODY>
; convert condition into a flag
J<cond> lagain
; while <COND> { <BODY1> break; <BODY2>}
;
lagain:
210. ; convert condition into a flag
JN<cond> ldone:
<BODY1>
JMP ldone
<BODY2>
JMP lagain
ldone:
137
; while <COND> { <BODY1> continue; <BODY2>}
;
lagain:
211. ; convert condition into a flag
JN<cond> ldone:
<BODY1>
JMP lcont
<BODY2>
lcont:
JMP lagain
ldone:
NOT covered: for, switch, and how to deal with more complex
expres-
sions (see “complete and shortcut” section).
138
Chapter 5
212. Variables
5.1 Declaring variables
Assembler language includes operators for declaring variables.
The primitive data declarations operators correspond to the
primitive
types in the assembler language:
139
[name] DB value[s] (Declare Byte)
[name] DW value[s] (Declare Word)
[name] DD value[s] (Declare Doubleword)
[name] DQ value[s] (Declare Qword)
[name] DT value[s] (Declare Tenbyte)
213. The [name] field of the declaration should be a unique
identifier; this
field is usually present but not required. The [name] field is
used to refer to
the variable and if absent, there is no way to refer to it.
The value field refers to the initial value of the variable, this
field is
always required.
Examples:
year dw 2020
a1 db ’a’ ; as char
a2 db 97 ; as decimal
a3 db 61h ; as hex
a4 db 01100001b ; as binary
dwrd dd 12345678
notice that a1,a2,a3 and a4 all provide the same initial value, in
214. different
140
formats. ’a’ is 97 since the character ’a’ is in the 97th position
in the
ASCII table.
Multiple initialization values are allowed, in such cases the
declared
variable is actually an array:
primes DW 2,3,5,7,11,13
str1 DB ’H’,’e’,’l’,’l’,’o’
str2 DB ’Hello’
str3 DB "H",’e’,"L","L",’o’
str4 DB "HELLO"
str5 DB ’HE’,"LLO"
215. All five strings above contain identical data: both single of
double quotes
are allowed – but single quotes must match single quotes and
double quotes
must match double. Further, strings can be enteres either as
strings or as
sequences of characters.
s DB ’Hello, World’,0
shows a zero-terminated string (C language string format).
One would use single quotes to enter a string that includes
double
quotes and vice versa:
m DB ’"Hello", he said’
and will have to break the string into parts if it includes both
single and
double quotes
m DB ’"Don’,"’",’t", he said’
216. Long sequence of initialization values can be broken into
multiple dec-
larations:
msg DB "ERROR: you must not use"
DB "this operator the way you do"
DB "please delete the program and"
DB ’read the textbook’,0
141
in the example above we do not name the lines after the first –
the program
will not refer to them.
Often, one needs to calculate the length of a string like msg
above. This
can be done during the run time, by counting the characters up
to the
terminator, or during compilation:
217. msg DB "ERROR: you must not use"
DB "this operator the way you do"
DB "please delete the program and"
DB ’read the textbook’
msgend DB 0
msglen DW offset msgend-offset msg
....
MOV CX,offset msgend-offset msg
The offset operator is similar to the address operator & in
C/C++.
offset msgend-offset msg is a constant, computed during the
compile
time, it is equal to the number of the characters in the message,
excluding
the terminating null character. With this value known we
actually do not
218. need the null terminator at all and can instead use
msg DB "ERROR: you must not use"
DB "this operator the way you do"
DB "please delete the program and"
DB ’read the textbook’
msgend LABEL BYTE
msglen DW offset msgend-offset msg
....
MOV CX,offset msgend-offset msg
where LABEL declares a byte type variable with NO initial
values. LABEL
is used instead of DB since the latter will always allocate at
least one byte.
Other types that may follow LABEL include WORD, DWORD,
QWORD, TENBYTE.
219. To specify an uninitialized variable, use “?”:
x DD ?
y DD ?
142
Declared variable can be referred to in code:
x DW ?
y DW ?
z DW ?
.....
MOV AX,x
ADD AX,y
MOV z,AX
220. Notice that declared variables have size and the usual size rules
apply:
MOV x,10 ; OK, size known from x
MOV AL,x ; error, size mismatch
MOV AL,byte ptr x ; OK, size casted to byte
The dup operator can be used to specify large arrays without
having to
list all the initial values:
ar1 DW 1,1,1,1,1
ar2 DW 5 dup(1)
two arrays above contain exactly the same data. More
practically useful
example would be
ar3 DW 1000 dup(?)
– a 1000-element unitialized array.
221. Initialized and unitialized entries can be used within the same
declara-
tion:
aa DW 1,2,?,?,5,6
fib DW 1,1,998 dup(?)
the fib array will be used to compute the Fibonacci numbers
below. First
two elements are initialized statically, the subsequent entries
will be com-
puted.
143
Data declarations are not executable instructions and in general
one
should separate them from code so they do not get executed.
This is usually
accomplished by placing them into a separate segment. If there
is only one
222. segment (.com files), then the typical approach is to structure
the program
as follows:
start: ; entry point of the program
JMP real_start
.....
data and procedure declarations
.....
real_start:
; code begins here
because of overjumping the data, it will never get executed.
An exception to the above is the case when we know the exactly
which
instructions the data corresponds to. For example,
MOV AX,BX
223. DB 90h
ADD SI,DI
is totally safe, 90h is the NOP operation! While there is no good
reason to
code NOP as data, similar approach can be used to enter an
instruction that
is not supported by the assembler compiler, or a form of the
instruction
assembler will not produce.
There are two different ways to encode ADD BX,CX as machine
code.
One of the is normally produced by the assembler compiler, the
other can
only be entered using db’s.
Exercise: Do it.
5.2 Using arrays
This section contains several examples of common snippets of
code.
224. Example 1: initialize a 256-element byte array to contain all
possible
256 characters in ascending order.
144
ba DB 256 dup(?)
.....
SUB BX,BX
MOV CX,256
l: MOV ba[BX],BL
INC BX
LOOP l
notice that we use the same register as both index and the
current value to
225. be stored; this is possible only because of the byte size of the
data.
Example 2: initialize a 1000-element word array to contain
numbers 0
through 999 in ascending order.
nums DW 1000 dup(?)
.....
SUB AX,AX
SUB BX,BX
MOV CX,1000
l: MOV nums[BX],AX
INC AX
ADD BX,2
LOOP l
226. Exercise: Explain what would happen if BX is incremented only
by 1.
Generally, the index in loops like above needs to be
incremented by the
byte size of the data – the index counts bytes, and not elements.
One very
important implication is that in high language loops of this type
there is
a hidden multiplication by the element size, depending on the
compiler it
may or may not be implemented as a multiplication.
Consider two ways to implement the above code in C :
short s[1000]; int i;
for (i=0; i<1000; i++)
s[i]=i;
145
227. this code has a hidden multiplication by 2, index i needs to be
translated
into an offset to compute the address of s[i]. The other way to
code the
example
short s[1000]; int i; short *p=s;
for (i=0; i<1000; i++)
*p++=i;
does not have a hidden multiplication and often would result in
better com-
piled code.
Example 3: for collection, let us also initialize a 1000-element
double-
word (int) array to contain numbers 0 through 999 in ascending
order.
nums DD 1000 dup(?)
.....
228. SUB EAX,EAX
SUB BX,BX
MOV CX,1000
l: MOV nums[BX],EAX
INC EAX
ADD BX,4
LOOP l
Example 4: Compute sum and average of the elements in a
1000-
element word array (in this and subsequent examples we assume
nums to
be initialized with some values first.)
nums DW 1000 dup(?)
sum DW ?
ave DW ?
230. IDIV BX
MOV ave,AX
(note that the answer will be rounded down to an integer)
Example 5: Find the largest element in an array of signed
numbers
nums DW 1000 dup(?)
max DW ?
.....
MOV AX,nums[0]
MOV BX,2
MOV CX,999
l: CMP AX,nums[BX]
JGE skip ; use JAE for unsigned
MOV AX,nums[BX]
231. skip:ADD BX,2
LOOP l
MOV max,AX
Use JLE and JBE if searching for the smallest entry.
Example 6: Compute Fibonacci numbers
fibs DW 1,1,998 dup(?)
.....
MOV BX,4
MOV CX,998
l: MOV AX,fibs[BX-4]
ADD AX,fibs[BX-2]
MOV fibs[BX],AX
ADD BX,2
232. LOOP l
147
Exercise: Would all the values be computed correctly?
Example 7: Find value t in an array of numbers
nums DW 1000 dup(?)
max DW ?
......
MOV AX,t
SUB BX,BX
MOV CX,1000
l: CMP AX,nums[BX]
233. JE found
ADD BX,2
LOOP l
not_found:
.....
found:
.....
A more interesting program would be a binary search... it not all
that
difficult in assembler. We assume nums contains signed
numbers in ascend-
ing order.
nums DW 1000 dup(?)
......
MOV AX,t
234. MOV SI,0 ; left interval bound
MOV DI,999*2 ;right interval bound
l: CMP SI,DI
JA not_found
148
MOV BX,SI
ADD BX,DI
SHR BX,2
SHL BX,1 ; middle point
CMP t,nums[BX]
JE found
236. ....
Exercise: Reverse the elements in an word array.
Exercise: Implement bubble sort.
149
5.3 Memory addressing syntax
16 bit addressing offers only the following nine schemes of
entering the
offset
• const
• BX+const
• BP+const (***)
• SI+const
237. • DI+const
• BX+SI+const
• BX+DI+const
• BP+SI+const (***)
• BP+DI+const (***)
Notes:
How to remember:
Base, or Index, or Both, or Neither
Double register addressing allows to efficiently access two
dimensional
array, but more often is used to access dynamic one-dimensional
arrays
(base points to the beginning of an array, index “indexes” it.)
This limitation allows for a compact encoding of the addressing
scheme.
Despite nine choices offered, only three bits are needed.
Options marked with (***) imply SS: segment use (others
238. default to
DS:).
150
5.4 Video memory in text mode.
For the default text video mode, the video memory begins in the
middle
of the B band, corresponding to the segment value B800. (the
reasons are
historical).
Let us begin with an example showing the idea, we will specify
the exact
rules later.
MOV AX,0B800h
MOV ES,AX ; ES=>video!
MOV byte ptr ES:[0],’A’
239. INT 20h ; terminate the program
If compiled (you would need to add extra lines to the file per
assembler
language syntax requirements) and run this program will deposit
letter A
into the upper left corner of the screen!
The general layout of video memory is the following:
The screen has 25 rows and 80 columns of characters; each cell
cor-
responds to two bytes in the video memory. The even bytes (0,
2, 4, ...)
are the ASCII characters, the odd bytes define the attributes
(colors) of the
character in the preceding even address. Let us provide an
example of this:
MOV AX,0B800h
MOV ES,AX ; ES=>video!
MOV byte ptr ES:[0],’A’
240. MOV byte ptr ES:[1],71h
INT 20h ; terminate the program
151
71h defines the attributes of the letter A on screen. The first
nibble (7)
is the background color, the second (1) is the foreground color:
A would
appear as blue on white, following these definitions:
• 0 – black (0x000000)
• 1 – blue (0x0000AA)
• 2 – green (0x00AA00)
• 3 – cyan (0x00AAAA)
• 4 – red (0xAA0000)
• 5 – magenta (0xAA00AA)
241. • 6 – brown (0xAA5500)
• 7 – white/light gray (0xAAAAAA)
• 8 – (dark) gray (0x555555)
• 9 – bright blue (0x5555FF)
• 10 – bright green (0x55FF55)
• 11 – bright cyan (0x55FFFF)
• 12 – bright red (0xFF5555)
• 13 – bright magenta (0xFF55FF)
• 14 – yellow (0xFFFF55)
• 15 – bright white (0xFFFFFF)
(In some modes the high bit of the background color indicates
blinking
rather than color, other variations of interpretation exist).
242. While MOV is the most common instruction to change video
memory,
any other instruction can be used. Consider:
152
MOV AX,0B800h
MOV ES,AX ; ES=>video!
MOV byte ptr ES:[0],’A’
MOV CX,256
L: INC byte ptr ES:[0]
LOOP L
INT 20h ; terminate the program
The program above will show A, then change it to B,....
eventually
coming back to an A.
243. Will you be actually able to see the letters changing onscreen?
Well..
no! This will happen too fast to notice.
Exercise: Fix it
Writing to video memory is the best way of doing full-screen
output,
most suitable for tables and games; other methods are used to
produce
scrollable console output (see INT 21h).
Assuming that the rows are numbered up-to-down 0 through 24,
and
the columns are numbered left-to-right 0 through 79, to write at
position
(X,Y) we should store information at offset 160 ∗ Y + 2 ∗ X in
the video
segment.
153
244. Chapter 6
Project #1
Due Date: 04/19/2021
Goal: implement Game of 1024.
This a new simple board game(rather a puzzle), you can play the
online
version Here.
Project specifics:
1. Use direct write to the video memory (seg 0xB800).
2. Output will be explained(today), input can be done using int
16h
(look up).
3. Recommendation to use color and line draw characters (2nd
half of
the ASCII OEM set)
4. Projects are individual and should not be copies of source
code found
elsewhere; such submissions will not be accepted.
245. 5. You do not need to follow the layout exactly, shortcuts that
simplify
programming but essentially keep the game the same are fine.
154
https://1024game.org/
Basic grading guidelines:
• C something works (but perhaps not really playable)
• B playable with glitches/deficiencies.
• A enjoyable to play (well... to the degree the entire idea of
such
puzzle is)
6.0.1 Program outline
Hopefully, of some help – this is an overall structure of the
program one
can use.
246. 0 Initialize the program, display.
1 Initialize the configuration
2 Display the board
3 Wait for a key (int 16h)
4 If the key indicates a move of a piece, update the
configuration, go to
step 2.
5 .If the key indicates new game, go to step 1.
6 If the key indicates quit, quit.
7 Ignore the key, go to step 3.
6.0.2 Submission
Submit the source (.asm) and the executable (.com or .exe),
ideally by
email.
247. Please not that some email providers kill attachments of certain
types.
Google is likely to reject .com or .exe, CCNY email does not
like .asm.
To avoid problems : rename the files to give them “safe”
extension, then
zip(or rar) files together. Please do not name the archive
“project.zip”,
rather use your name to name the archive.
(Example: rename project.com to project.com.txt)
155
When submitting, CC: yourself – this way you would see that
the project
does come through.
Submission can be made either my CCNY email or to
[email protected] (the
latter address does not kill any attachments!).
248. Submission is your responsibility.
156
6.1 How to compile and run a program.
Assembler distribution
(Download, unpack, make sure it works. The distribution
contains DOS-
BOX, TASM, TLINK).
At this time we only need to know how to create a binary from
an
assembler program, a sample HELLO.ASM is provided.
.model tiny
.code
org 100h
start:
249. ; BEGIN BODY
mov dx, offset hello
mov ah, 9
int 21h
mov ah, 4ch
int 21h
hello db ’Hello, world.’,13,10,’$’
; END BODY
end start
The part between BEGIN BODY and END BODY is replaced by
your own
code.
To compile: (assuming you are in the directory where hello.asm
is
250. unpacked; under DOSBOX or 32-bit OS)
C> BINTASM hello
This will create file HELLO.OBJ. To link
C>BINTLINK hello /t
157
http://tinyurl.com/cjdyruy
This will create file HELLO.COM. To run
C>Hello
This will say (guess what?)
This output method (console write) will not work right for full-
screen
apps, write into video directly, instead:
.model tiny
251. .code
org 100h
start: jmp real_start
; place your variables and/or procedures here.
real_start:
; BEGIN BODY
MOV AX,0B800h
MOV ES,AX ; ES=>video!
MOV byte ptr ES:[0],’A’
MOV CX,256
L: INC byte ptr ES:[0]
LOOP L
INT 20h ; terminate the program
252. end start
158
Chapter 7
Stack and procedures
Hardware stack discussed in the chapter is not a necessary
feature of a
hardware, many earlier designs did not provide it. However,
having it
helps in many ways, most notably in the ability to implement
procedures
efficiently.
7.1 Using stack
We will begin working with the stack by using it. For this
section it is
not important just how it works psysically, we only state that
there is –
253. somewhere in memory – a data structure, implemented in
hardware, that
parallels the stack software data structure. Namely, we can push
things
on the stack, we can pop them back, and the operations work in
the LIFO
fashion: Last In, First Out.
Let us begin with the syntax:
PUSH w/d m/r (PUSH word or doubleword)
POP w/d m/r (POP word or doubleword)
Notice that byte operations are not supported. Further, notice
that push-
159
ing or poping memory actually results in a memory-to-memory
operation!
These stack operations are one of the exceptions to the general
no memory-