Skip to content

Instantly share code, notes, and snippets.

@questor
Last active December 17, 2020 18:20
Show Gist options
  • Save questor/563c88f7bbc4fcb09094b545e58b6fb5 to your computer and use it in GitHub Desktop.
Save questor/563c88f7bbc4fcb09094b545e58b6fb5 to your computer and use it in GitHub Desktop.
I want to show how you can optimize the rendering functions without going into changes on the upper level (like removing round
stuff and so on), but I want to teach also the ideas behind the optimizations I'm proposing. I have no esp32 board to test my
changes, to I will try to teach the ideas and let others test the optimizations. And I don't want to write an fast software
texture mapper (again) without explaining why it's much faster than the current source code.
1.) optimizing color conversion
this should be done automatically by your compiler (hopefully, has to be checked at assembly level!), but some background:
multiplications and divides can be optimized if they're a power of 2 by bit-shifts, means divison by 255 can also be done with
bitshit of 8 bits to the right. so looking at the first color conversion routine:
return (((c * 31) / 255) << 11) |
(((c * 63) / 255) << 5) |
((c * 31) / 255);
is basically: (((c*31)>>8)<<11) and some bitshifts can be removed because you can also do ((c*31)<<3)&0xf0 (the and-operation
is needed to clear bits originally cleared by the shift-operation). next thing (should also be done by your compiler): on
lower cpus multiplications are expensive and can be substituted with additions and shifts, for example a*31 can also be written
as ((a<<5)-a) which could be faster depending on the timings of your multiplication assembly command (a<<5 calculates a*32 and
subtracting one a to get the correct result).
2.) optimize your textures
You currently allocate for every line in your texture an extra column-memory which is overkill and hinders other optimizations
(https://github.com/LAK132/ImDuino/blob/master/softraster.h#L176). This should be changed to allocate one big block for the
complete texture (something like buffer = malloc(x*y*sizeof(one_pixel)). Allocating columns instead of rows is especially bad
for your cache (if the esp32 has a cache ;) because you will trash your cache most likely in EVERY pixel access. and if you have
a continues block if pixel data you can step from line to line with a defined offset (usually the line-width) and not fragmented
memory (with memory-allocation-information inbetween).
3.) optimize renderPixel
by removing the function. if you use a renderPixel function to draw each single pixel you have a lot of calculations done per
pixel which are only needed one time or per each horizontal line (horizontal to please you cache). A little example:
for(int y=0; y<h; ++y) {
for(int x=0; x<w; ++x) {
setPixel(x,y,c);
}
}
Hidden here are a lot of duplicate calls like color conversion for each pixel and pixel-destination offset calculation, much
better would be:
uint8_t c = col32to8(pixel->c);
uint8_t *p = *tex_memory; //here you can jump one time to the correct pixel location in case you have offsets
for(int y=0; y<h; ++y) {
for(int x=0; x<w; ++x) {
*(p++) = c;
}
//here you can also use an offset to jump to the next line in the case you don't want to fill the whole line
}
Here you have the color conversion one time and you can calculate the address you want to write to one time (maybe with offsets
for x,y positions), but only one time during setup and not for each pixel more or less the same calculation (calculating the
destination address usually involves a multiplikation like yPos*width+x which is in this case also only needed one time. and jumping
from one line to the next one is also faster if you have a big block for your screen and is only a addition instead of a multiplikation
when using the complete offset-calculation for each line.)
4.) optimize renderLine
first thing: you first check if the line is outside the screen and then if it's outside of the clipRectangle. Combine that so
if the clipRectangle is not set assign the screencoords to the clipRectangle to remove one check. Here the same applies to like
the setPixel function, inline the setPixel function directly here to calculate screen-offsets only one time and increment the
pointer to the pixel if you want to go to the next pixel in the line instead of recalculating the complete offset into the buffer.
and calculate the converted color only one time for the whole line instead of per pixel.
Another approach could be to use specialized renderLine versions for each pixel-format you have to have a fast inner loop.
5.) optimize renderTriangle
here I'm not sure what is done here, but for me it seems there are cases with 4 square-root calls per line which is really
expensive! Usually you calculate your delta's two times for a normal triangle and only step these values per line (and pixel
for textures) without using perspective correct textures (which are not needed here in this more or less 2d case). some more
background in this very old but still useful article about texture mapping: http://www.multi.fi/~mbc/sources/fatmap.txt
In general: try to keep your inner loops as fast as possible, in this case your inner loops are (from most important to optimize
to not so important: set-pixel-routine which should never be a single function, pixel-innerloop in horzontal-line routine,
line-routine and triangle setup. And try to do calculations only really one time when it's needed, so maybe write extra triangle
routines for fixed color versions, texture mapped versions and so on to not have to convert a pixel color again and again even
if it's static for the complete triangle (hence you spend doing work
again and again for a fixed value).
Hope that helps :)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment