This is a work-in-progress note pad of all the things I've found about gcc to make the best code possible.
Ever.
Generally, most other PC libraries are okay, including SDL and OpenGL, but ixemul will take awesome performing code and make it run like it's walking through a tar pit -- especially on any stdio file operations. When porting "small and dirty" POSIX applications where performance does not matter, then who cares, use ixemul. For everything else, don't. Don't even use libnix. Try as much as possible to use AmigaOS native functions like AllocVec over C standard functions like malloc.
Do not use regparam as this doesn't work well with any linker libraries (especially libnix or ixemul). The best option is to explicity use register variables in your function prototypes. For example:
int sum(int *array __asm("a0"), short size __asm("d1"));
This avoids using the stack (in most cases) and can (again, in most cases), siginificantly decrease the size of code. Small stub functions are then more likely to get inlined efficiently. The general exeption to this is when the function grows too large and needs to use more registers than the 68000 has, in which case the stack still has to be used.
This one is tricky as GCC does not like emitting the decrement-branch instruction. To ensure that it does you need to do three things; use a 16-bit short (signed or unsigned doesn't matter); decrement you counter by one before the loop and then compare the exit term to -1 (signed) or 0xFFFFu (unsigned). For example:
int sum(int *array __asm("a0"), unsigned short size __asm("d1")) {
int sum = 0;
size -= 1;
do sum += *array++;
while (--size != 0xffffu);
return sum;
}
The inner loop here is now two instructions (add.l and dbra). If we were to rewrite this to use a more traditional C/C++ loop, it would be three or sometimes four instructions (gcc may at times add the superlative cmp.l instruction).
While it's tempting to always compile for speed (-03), this will often produce code that places more stress on the instruction cache. The 680x0 processors do not have the large amount of cache that optimizing for speed would depend on, thus, it's generally better to produce smaller, more cacheable code.
As a general rule, this also applies to how you write your algorithms -- on the 680x0 it is generally better to perform simple arithmetic instead of using huge lookup tables for the same reason. Obviously, if the arithmetic is complex, or uses a lot of floating point (which is expensive on ANY 680x0 processor) then tables may still be the better option, but don't just assume that a table will be faster than the the math.
This will net the best performance under all circumstances. While some optimizations will have little to no impact on the 68000 itself, this will ensure that the code is going to take advantage of some features of the 680x0 processors. Most noteably, tuning for the 060 will try and REORDER instructions which will have no impact on the 68000 but will benefit users of the superscalar processors greatly!
This frees up a valuable address register and helps take pressure off the stack.
In some cases the compiler can infer bit-twiddling when using masks and shifts, but in most cases you'll generate better code more reliably by using a bitfield structure and accessing elements that way. While the bitfield instructions are not especially fast on the 68020/030, they're single-cycle operations on the 68040/060 (and the Vampire).
On the 68000, only 16-bit x 16-bit -> 32-bit result is supported, everything else will trigger a subroutine to a function to perform a 32-bit multiplication the slow way. GCC should use MULU/MULS on the 68000 when both multiplicands are shorts of the same sign (e.g., both are shorts or both are unsigned shorts, but not one of each). This also applies to division -- if you perform a 32-bit / 16->bit -> 16-bit result, you're also get the DIVU/DIVS instruction instead of a library call.