Complex instructions
Avoid complex instructions ( lods, stos, movs, cmps, scas,
loop, xadd, enter, leave). Complex instructions are instructions that do
multiple things. For instance stosb writes a byte to memory and also increments
EDI. They stopped making these fast with the original Pentium because they were
trying to make it more RISC like. Using REP and the string instructions is
still fast. That is the only exception to the case.
Don't use INC/DEC on P4
On the P4 use ADD/SUB in place of INC/DEC. Generally it is
faster. ADD/SUB runs in 0.5 cycles. INC/DEC takes 1 cycle.
Rotating
Avoid rotate by a register or rotate by an immediate value
of anything but a 1.
Eliminate unnecessary compare instructions
Eliminate unnecessary compare instructions by doing the
appropriate conditional jump instruction based on the flags that are already
set from a previous arithmetic instruction.
dec ecx
cmp ecx,0
jnz loop_again
;gets changed to
dec ecx
jnz loop_again
LEA is still really cool, except for on the P4 it tends to
be slow.
You can perform multiple math operations all in one
instruction and it does not affect the flags register so you can put in in
between one register being modified and a flags comparison jump on the next
line.
top_of_loop:
dec eax
lea edx,[edx*4+3] ; multiply by 4 and add 3. Does not affect
flags
jnz top_of_loop ; so the next instruction doesn't get
hosed.
ADC and SBB.
Most compilers don't really make good use of ADC and SBB.
You can get good speeds ups with that. Adding 2 64-bit numbers together, or
adding big numbers together. Keep in mind that on the P4 ADC and SBB are slow.
As a work around you can use "addq", and use MMX to do this. So the
second optimization suggestion for this is to use MMX to do the adding or
subtracting. You just have to have a processor that supports MMX.
add eax,[fred]
adc edx,[fred+4]
; the above 2
statements are the same as the below 3 statements
movd mm0,[fred] ; Get 32-bit value in MM0
movd mm1,[fred+4] ; Get 32-bit value in MM1
paddq mm0,mm1 ; This is an unoptimized way to do
it. You would
; really pre-read MM0 and MM1 a loop in advance.
; I did it this way for ease of understanding.
ROL, ROR, RCL, and RCR and BSWAP.
It is a cool trick to switch from Big Endian to Little
Endian using BSWAP. Also you can use it for temporary storage of a 16-bit or
8-bit value in the upper half of the register. Likewise you can use ROL and ROR
for storing 8-bit and 16-bit values. It's a way to get more
"registers". If all you are dealing with are 16-bit values, you can
turn your 8 32-bit registers into 16 16-bit registers. Which gives you a lot
more registers to use. RCL and RCR can also easily be used for counting the
number of bits that are set in a register. Keep in mind that ROL, ROR, RCL, RCR
and BSWAP are all slow on the P4. The rotate instructions are about twice as
fast as BSWAP. So if you have to use one or the other on the P4 use the rotate
ones.
xor edx,edx ; set both 16-bit registers to 0
mov dx,234 ; set the first 16-bit register to
234
bswap edx ; swap it so the second one is
ready
mov dx,345 ; set the second 16-bit register to
345
bswap edx ; swap to the first one
add dx,5 ; add 5 to the first one
bswap edx ; swap to the second one
add dx,7 ; add 7 to it
String instructions.
Most compilers don't make good use of the string
instructions ( scas, cmps, stos, movs, and lods). So checking to see if that is
faster than some library routine can be a win. For instance I was really
surprised when I looked at strlen() in VC++. In the radix40 code it ran in 416
cycles for a 100 byte string!!! I thought that was absurdly slow.
Multiply to divide.
If you have a full 32-bit number and you need to divide, you
can simply do a multiply and take the top 32-bit half as the result. This is
faster because multiplication is faster than division. ( thanks to pdixon for
the tip).
Dividing by a constant.
There's some nice information now how to divide by a
constant in Agner Fog's pentopt.pdf document. I wrote a program that you pass
in the number you want to divide by, and it will print out the assembler code
sequence. I will dig it up later and post it. Here is a link to Agner's
document. Agner's Pentopt PDF (http://www.agner.org/assem/)
Unrolling.
This is a guideline. Unrolling falls in the General
Optimiation category but I wanted to add a footnote. I always set up my
Unrolling with a macro that unrolls an EQUATE value amount. That way you can
try different values and see which is best easily. You want the unrolling to
fit in the L1 code cache ( or trace cache). Using an equate makes it easy to
try different unroll amounts to find the fastest one.
UNROLL_AMT equ 16 ;
# of times to unroll the loop
UNROLL_NUM_BYTES equ
4 ; # of bytes handled in 1 loop
iteration
mov ecx,1024
looper:
offset2 = 0
REPEAT UNROLL_AMT
add eax,[edi+offset2]
offset2 = offset2 + UNROLL_NUM_BYTES
add edi,UNROLL_AMT * UNROLL_NUM_BYTES ; we dealt with 16*4 bytes.
sub ecx,UNROLL_AMT ; subtract from loop counter the # of loops
we unrolled.
jnz looper
MOVZX.
Use MOVZX to avoid partial register stalls. I use MOVZX a
lot. A lot of people XOR the full 32-bit register first. But MOVZX does the
equivalent thing without having to have an extra XOR instruction. Also you had
to do the XOR enough in advance to give it time to complete. With MOVZX you
don't have to worry about that.
Using MOVZX to avoid a SHIFT and AND instruction
I ran across this bit of C code I was trying to speed up
using assembler. The_array is a dword array. The code is trying to get a
different byte from a dword in the array passed upon which pass this is over
the data. "Pass" is a variable that goes from 0 to 3 for each byte in
a particular dword.
unsigned char
c = ((the_array[i])>>(Pass<<3)) & 0xFF;
; I got rid of the "pass" variable by unrolling
the loop 4 times.
; So I had 4 of these each one seperated by lots of C code.
unsigned char
c = (the_array[i])>>0) & 0xFF;
unsigned char
c = (the_array[i])>>8) & 0xFF;
unsigned char
c = (the_array[i])>>16) & 0xFF;
unsigned char
c = (the_array[i])>>24) & 0xFF;
What if I can get rid of the SHIFT and the AND using
assembler? That would save me 2 instructions. Not to mention the fact that the
P4 is very slow when doing SHIFT instructions ( 4 cycles!!!). So try to avoid
shifts where possible. SO taking just the second to last line that shifts right
16 as our example
; esi points to the_array
mov eax,[esi]
shr eax,16
and eax,0FFh
; So how do we change that to get rid of the AND and SHR?
; We do a MOVZx with the 3rd byte in the dword.
movzx eax,byte ptr [esi+2] ;unsigned char c =
(the_array[i])>>16) & 0xFF;
Align, align, align.
It is really important to align both your code and data to
get a good speed up. I generally align code on 4 byte boundaries. For data I
align 2 byte data on 2 byte boundaries, 4 byte data on 4 byte boundaries, 8
byte data on 8 byte boundaries, 16 byte data on 16 byte boundaries. In general
if you don't align your SSE or SSE2 data on a 16-byte boundary you will get an
exception. You can align your data in VC++ if you have the processor pack. They
added support for both static data and dynamic memory. For static data you use
__declspec(align(4)) - alignes on a 4 byte boundary.
BSR for powers of 2.
You can use BSR to count the highest power of 2 that goes
into a variable.
XORing a register with itself to zero it.
This is an oldie, but I am including it anyway. It also has
a side benefit of clearing dependencies on the register. That is why sometimes
you will see people use XOR in that fashion, before doing a partial register
access. I prefer using MOVZX to doing it that way because it is trickier to do
using a XOR ( read my above comments about in #12 above talking about MOVZX) .
On the P4 they also added support for PXOR to break dependencies in that
fashion. I think the P3 does the same thing.
Use XOR and DIV.
If you know your data can be unsigned for a DIVISION, use
XOR EDX, EDX, then DIV. It's faster than CDQ and IDIV.
Try to avoid obvious dependencies.
If you modify a register and then compare it to some value
on the very next line, instead try and put some other register modification in
between. Dependencies are any time you modify a register and then read it or
write it shortly afterwards.
inc edi
inc eax
cmp eax,1 ; this line has a dependency with the
previous line, so it will stall.
jz fred
;shuffling the instructions around we can help break up
dependencies.
inc eax
inc edi
cmp eax,1
jz fred
Instructions to avoid on P4.
On P4's try to avoid the following instructions, adc, sbb,
rotate instructions, shift instructions, inc, dec, lea, and any instruction
taking more than 4 uops. How do you know the processor running the code is a
P4? CPUID.
Using lookup tables.
On the P4 sometimes you can get around the long latency
instructions that I listed previously by doing lookup tables. Thankfully on
P4's they come with really fast memory. So having to do a lookup table doesn't
hurt performance as much if it isn't in the cache.
Use pointers instead of calculating indexes.
A lot of times in loops in C there will be multiplications
by non-powers of 2 numbers. You can easily get around this by adding instead.
Here is an example that uses a structure.
typedef struct fred
{
int fred;
char bif;
} freddy_type;
freddy_type charmin[80];
The size of freddy_type is 5 bytes. If you try and access
them in a loop the compiler will generate code for multipling by 5 for each
array access!!!! (Ewwwwwwwwwwwww). So how do we do it properly?
for ( int t = 0; t < 80; t++)
{
charmin[t].fred =
rand(); // the compiler multiplies by 5 to get the offset, EWWWWWWWW!
charmin[t].bif =
(char)(rand() % 256);
}
; in assembler we start with an offset of 0, that points to
the first data item.
; And then we add 5 to it each loop iteration to avoid the
MUL.
mov esi,offset charmin
mov ecx,80
fred_loop:
;... perform
operations on the FRED and BIF elements in freddy_type
add esi,5 ;make it point to the next
structure entry.
dec ecx
jnz fred_loop
The MUL removal applies to loops as well. I have seen people
do multiplies in loops as part of incrementing the variable or for terminating
condition. So try doing addition instead.
Conform to default branch predictions.
Try to set up your code such that backward conditional jumps
are usually taken, and forward conditional loops are almost never taken. That
has to do with branch prediction. The static branch predictor uses that simple
rule to guess if a conditional jump is taken or not. So have a loop that has a
backwards conditional jump at the end. And then have special exit conditions
from that same loop that executes a forward jump that only exits on a certain
condition that doesn't often occur.
Eliminate branches
Eliminate branch where possible. This might seem obvious,
but I have seen some people use too many branches in their assembler code. Keep
it simple. Use as few branches as possible.
Using CMOVcc to remove branches
I have yet to see the CMOVcc instructions actually be faster
than a conditional jump. So I recommend using conditional jumps over CMOVcc. It
might be faster in the case where your jumps aren't easily guessable by the
branch prediction logic. So if that is the case with you, benchmark it and see.
Local vs. Global variables
Use local variables for a procedure over using a global
variable. If you use local variables you'll get less cache misses.