Thursday, June 2, 2011

Complex X86 instructions

Complex instructions
Avoid complex instructions ( lods, stos, movs, cmps, scas, loop, xadd, enter, leave). Complex instructions are instructions that do multiple things. For instance stosb writes a byte to memory and also increments EDI. They stopped making these fast with the original Pentium because they were trying to make it more RISC like. Using REP and the string instructions is still fast. That is the only exception to the case.

Don't use INC/DEC on P4
On the P4 use ADD/SUB in place of INC/DEC. Generally it is faster. ADD/SUB runs in 0.5 cycles. INC/DEC takes 1 cycle.

Rotating
Avoid rotate by a register or rotate by an immediate value of anything but a 1.

Eliminate unnecessary compare instructions
Eliminate unnecessary compare instructions by doing the appropriate conditional jump instruction based on the flags that are already set from a previous arithmetic instruction.

        dec     ecx
        cmp     ecx,0
        jnz     loop_again

;gets changed to
        dec     ecx
        jnz     loop_again
LEA is still really cool, except for on the P4 it tends to be slow.
You can perform multiple math operations all in one instruction and it does not affect the flags register so you can put in in between one register being modified and a flags comparison jump on the next line.

top_of_loop:

    dec   eax
    lea   edx,[edx*4+3]     ; multiply by 4 and add 3. Does not affect flags
    jnz   top_of_loop       ; so the next instruction doesn't get hosed.
ADC and SBB.
Most compilers don't really make good use of ADC and SBB. You can get good speeds ups with that. Adding 2 64-bit numbers together, or adding big numbers together. Keep in mind that on the P4 ADC and SBB are slow. As a work around you can use "addq", and use MMX to do this. So the second optimization suggestion for this is to use MMX to do the adding or subtracting. You just have to have a processor that supports MMX.

    add    eax,[fred]
    adc    edx,[fred+4]

    ; the above 2 statements are the same as the below 3 statements

    movd   mm0,[fred]       ; Get 32-bit value in MM0
    movd   mm1,[fred+4]     ; Get 32-bit value in MM1
    paddq  mm0,mm1          ; This is an unoptimized way to do it. You would
                            ; really pre-read MM0 and MM1 a loop in advance.
                            ; I did it this way for ease of understanding.
ROL, ROR, RCL, and RCR and BSWAP.
It is a cool trick to switch from Big Endian to Little Endian using BSWAP. Also you can use it for temporary storage of a 16-bit or 8-bit value in the upper half of the register. Likewise you can use ROL and ROR for storing 8-bit and 16-bit values. It's a way to get more "registers". If all you are dealing with are 16-bit values, you can turn your 8 32-bit registers into 16 16-bit registers. Which gives you a lot more registers to use. RCL and RCR can also easily be used for counting the number of bits that are set in a register. Keep in mind that ROL, ROR, RCL, RCR and BSWAP are all slow on the P4. The rotate instructions are about twice as fast as BSWAP. So if you have to use one or the other on the P4 use the rotate ones.

    xor   edx,edx           ; set both 16-bit registers to 0
    mov   dx,234            ; set the first 16-bit register to 234
    bswap edx               ; swap it so the second one is ready
    mov   dx,345            ; set the second 16-bit register to 345
    bswap edx               ; swap to the first one
    add   dx,5              ; add 5 to the first one
    bswap edx               ; swap to the second one
    add   dx,7              ; add 7 to it
String instructions.
Most compilers don't make good use of the string instructions ( scas, cmps, stos, movs, and lods). So checking to see if that is faster than some library routine can be a win. For instance I was really surprised when I looked at strlen() in VC++. In the radix40 code it ran in 416 cycles for a 100 byte string!!! I thought that was absurdly slow.

Multiply to divide.
If you have a full 32-bit number and you need to divide, you can simply do a multiply and take the top 32-bit half as the result. This is faster because multiplication is faster than division. ( thanks to pdixon for the tip).

Dividing by a constant.
There's some nice information now how to divide by a constant in Agner Fog's pentopt.pdf document. I wrote a program that you pass in the number you want to divide by, and it will print out the assembler code sequence. I will dig it up later and post it. Here is a link to Agner's document. Agner's Pentopt PDF (http://www.agner.org/assem/)

Unrolling.
This is a guideline. Unrolling falls in the General Optimiation category but I wanted to add a footnote. I always set up my Unrolling with a macro that unrolls an EQUATE value amount. That way you can try different values and see which is best easily. You want the unrolling to fit in the L1 code cache ( or trace cache). Using an equate makes it easy to try different unroll amounts to find the fastest one.

UNROLL_AMT       equ   16   ; # of times to unroll the loop
UNROLL_NUM_BYTES equ    4   ; # of bytes handled in 1 loop iteration

        mov     ecx,1024
looper:
offset2 = 0
REPEAT UNROLL_AMT
        add     eax,[edi+offset2]
offset2 = offset2 + UNROLL_NUM_BYTES
        add     edi,UNROLL_AMT * UNROLL_NUM_BYTES   ; we dealt with 16*4 bytes.
        sub     ecx,UNROLL_AMT  ; subtract from loop counter the # of loops we unrolled.
        jnz     looper
MOVZX.
Use MOVZX to avoid partial register stalls. I use MOVZX a lot. A lot of people XOR the full 32-bit register first. But MOVZX does the equivalent thing without having to have an extra XOR instruction. Also you had to do the XOR enough in advance to give it time to complete. With MOVZX you don't have to worry about that.

Using MOVZX to avoid a SHIFT and AND instruction
I ran across this bit of C code I was trying to speed up using assembler. The_array is a dword array. The code is trying to get a different byte from a dword in the array passed upon which pass this is over the data. "Pass" is a variable that goes from 0 to 3 for each byte in a particular dword.

        unsigned char c = ((the_array[i])>>(Pass<<3)) & 0xFF;

; I got rid of the "pass" variable by unrolling the loop 4 times.
; So I had 4 of these each one seperated by lots of C code.
        unsigned char c = (the_array[i])>>0) & 0xFF;
        unsigned char c = (the_array[i])>>8) & 0xFF;
        unsigned char c = (the_array[i])>>16) & 0xFF;
        unsigned char c = (the_array[i])>>24) & 0xFF;
What if I can get rid of the SHIFT and the AND using assembler? That would save me 2 instructions. Not to mention the fact that the P4 is very slow when doing SHIFT instructions ( 4 cycles!!!). So try to avoid shifts where possible. SO taking just the second to last line that shifts right 16 as our example
; esi points to the_array

        mov     eax,[esi]
        shr     eax,16
        and     eax,0FFh

; So how do we change that to get rid of the AND and SHR?
; We do a MOVZx with the 3rd byte in the dword.

        movzx   eax,byte ptr [esi+2]            ;unsigned char c = (the_array[i])>>16) & 0xFF;

Align, align, align.
It is really important to align both your code and data to get a good speed up. I generally align code on 4 byte boundaries. For data I align 2 byte data on 2 byte boundaries, 4 byte data on 4 byte boundaries, 8 byte data on 8 byte boundaries, 16 byte data on 16 byte boundaries. In general if you don't align your SSE or SSE2 data on a 16-byte boundary you will get an exception. You can align your data in VC++ if you have the processor pack. They added support for both static data and dynamic memory. For static data you use __declspec(align(4)) - alignes on a 4 byte boundary.

BSR for powers of 2.
You can use BSR to count the highest power of 2 that goes into a variable.

XORing a register with itself to zero it.
This is an oldie, but I am including it anyway. It also has a side benefit of clearing dependencies on the register. That is why sometimes you will see people use XOR in that fashion, before doing a partial register access. I prefer using MOVZX to doing it that way because it is trickier to do using a XOR ( read my above comments about in #12 above talking about MOVZX) . On the P4 they also added support for PXOR to break dependencies in that fashion. I think the P3 does the same thing.

Use XOR and DIV.
If you know your data can be unsigned for a DIVISION, use XOR EDX, EDX, then DIV. It's faster than CDQ and IDIV.

Try to avoid obvious dependencies.
If you modify a register and then compare it to some value on the very next line, instead try and put some other register modification in between. Dependencies are any time you modify a register and then read it or write it shortly afterwards.

   inc edi
   inc eax
   cmp eax,1    ; this line has a dependency with the previous line, so it will stall.
   jz  fred

;shuffling the instructions around we can help break up dependencies.
   inc eax
   inc edi
   cmp eax,1
   jz  fred
Instructions to avoid on P4.
On P4's try to avoid the following instructions, adc, sbb, rotate instructions, shift instructions, inc, dec, lea, and any instruction taking more than 4 uops. How do you know the processor running the code is a P4? CPUID.

Using lookup tables.
On the P4 sometimes you can get around the long latency instructions that I listed previously by doing lookup tables. Thankfully on P4's they come with really fast memory. So having to do a lookup table doesn't hurt performance as much if it isn't in the cache.

Use pointers instead of calculating indexes.
A lot of times in loops in C there will be multiplications by non-powers of 2 numbers. You can easily get around this by adding instead. Here is an example that uses a structure.

typedef struct fred
{
   int fred;
   char bif;
} freddy_type;

freddy_type charmin[80];
The size of freddy_type is 5 bytes. If you try and access them in a loop the compiler will generate code for multipling by 5 for each array access!!!! (Ewwwwwwwwwwwww). So how do we do it properly?

for ( int t = 0; t < 80; t++)
{
   charmin[t].fred = rand(); // the compiler multiplies by 5 to get the offset, EWWWWWWWW!
   charmin[t].bif = (char)(rand() % 256);
}

; in assembler we start with an offset of 0, that points to the first data item.
; And then we add 5 to it each loop iteration to avoid the MUL.

   mov   esi,offset charmin
   mov   ecx,80
fred_loop:
   ;... perform operations on the FRED and BIF elements in freddy_type
   add   esi,5                     ;make it point to the next structure entry.
   dec   ecx
   jnz   fred_loop
The MUL removal applies to loops as well. I have seen people do multiplies in loops as part of incrementing the variable or for terminating condition. So try doing addition instead.

Conform to default branch predictions.
Try to set up your code such that backward conditional jumps are usually taken, and forward conditional loops are almost never taken. That has to do with branch prediction. The static branch predictor uses that simple rule to guess if a conditional jump is taken or not. So have a loop that has a backwards conditional jump at the end. And then have special exit conditions from that same loop that executes a forward jump that only exits on a certain condition that doesn't often occur.

Eliminate branches
Eliminate branch where possible. This might seem obvious, but I have seen some people use too many branches in their assembler code. Keep it simple. Use as few branches as possible.

Using CMOVcc to remove branches
I have yet to see the CMOVcc instructions actually be faster than a conditional jump. So I recommend using conditional jumps over CMOVcc. It might be faster in the case where your jumps aren't easily guessable by the branch prediction logic. So if that is the case with you, benchmark it and see.

Local vs. Global variables
Use local variables for a procedure over using a global variable. If you use local variables you'll get less cache misses.