questor · December 17, 2020 18:20
diff --git a/gistfile1.txt b/gistfile1.txt

 I want to show how you can optimize the rendering functions without going into changes on the upper level (like removing round 
 stuff and so on), but I want to teach also the ideas behind the optimizations I'm proposing. I have no esp32 board to test my 
 changes, to I will try to teach the ideas and let others test the optimizations. And I don't want to write an fast software 
 texture mapper (again) without explaining why it's much faster than the current source code.

 1.) optimizing color conversion
 this should be done automatically by your compiler (hopefully, has to be checked at assembly level!), but some background:
 multiplications and divides can be optimized if they're a power of 2 by bit-shifts, means divison by 255 can also be done with 
 bitshit of 8 bits to the right. so looking at the first color conversion routine:
    return (((c * 31) / 255) << 11) |
           (((c * 63) / 255) << 5) |
            ((c * 31) / 255);
 is basically: (((c*31)>>8)<<11) and some bitshifts can be removed because you can also do ((c*31)<<3)&0xf0  (the and-operation 
 is needed to clear bits originally cleared by the shift-operation). next thing (should also be done by your compiler): on 
 lower cpus multiplications are expensive and can be substituted with additions and shifts, for example a*31 can also be written 
 as ((a<<5)-a) which could be faster depending on the timings of your multiplication assembly command (a<<5 calculates a*32 and
 subtracting one a to get the correct result).
 
 2.) optimize your textures
 You currently allocate for every line in your texture an extra column-memory which is overkill and hinders other optimizations
 (https://github.com/LAK132/ImDuino/blob/master/softraster.h#L176). This should be changed to allocate one big block for the 
 complete texture (something like buffer = malloc(x*y*sizeof(one_pixel)). Allocating columns instead of rows is especially bad 
 for your cache (if the esp32 has a cache ;) because you will trash your cache most likely in EVERY pixel access. and if you have 
 a continues block if pixel data you can step from line to line with a defined offset (usually the line-width) and not fragmented 
 memory (with memory-allocation-information inbetween).

 3.) optimize renderPixel
 by removing the function. if you use a renderPixel function to draw each single pixel you have a lot of calculations done per 
 pixel which are only needed one time or per each horizontal line (horizontal to please you cache). A little example:
    for(int y=0; y<h; ++y) {
      for(int x=0; x<w; ++x) {
        setPixel(x,y,c);
      }
     }
 Hidden here are a lot of duplicate calls like color conversion for each pixel and pixel-destination offset calculation, much 
 better would be:
     uint8_t c = col32to8(pixel->c);
     uint8_t *p = *tex_memory;      //here you can jump one time to the correct pixel location in case you have offsets
     for(int y=0; y<h; ++y) {
       for(int x=0; x<w; ++x) {
         *(p++) = c;
       }
       //here you can also use an offset to jump to the next line in the case you don't want to fill the whole line
     }
 Here you have the color conversion one time and you can calculate the address you want to write to one time (maybe with offsets 
 for x,y positions), but only one time during setup and not for each pixel more or less the same calculation (calculating the 
 destination address usually involves a multiplikation like yPos*width+x which is in this case also only needed one time. and jumping
 from one line to the next one is also faster if you have a big block for your screen and is only a addition instead of a multiplikation
 when using the complete offset-calculation for each line.) 

 4.) optimize renderLine
 first thing: you first check if the line is outside the screen and then if it's outside of the clipRectangle. Combine that so 
 if the clipRectangle is not set assign the screencoords to the clipRectangle to remove one check. Here the same applies to like 
 the setPixel function, inline the setPixel function directly here to calculate screen-offsets only one time and increment the 
 pointer to the pixel if you want to go to the next pixel in the line instead of recalculating the complete offset into the buffer. 
 and calculate the converted color only one time for the whole line instead of per pixel.
 Another approach could be to use specialized renderLine versions for each pixel-format you have to have a fast inner loop.

 5.) optimize renderTriangle
 here I'm not sure what is done here, but for me it seems there are cases with 4 square-root calls per line which is really 
 expensive! Usually you calculate your delta's two times for a normal triangle and only step these values per line (and pixel 
 for textures) without using perspective correct textures (which are not needed here in this more or less 2d case). some more 
 background in this very old but still useful article about texture mapping: http://www.multi.fi/~mbc/sources/fatmap.txt

 In general: try to keep your inner loops as fast as possible, in this case your inner loops are (from most important to optimize 
 to not so important: set-pixel-routine which should never be a single function, pixel-innerloop in horzontal-line routine, 
 line-routine and triangle setup. And try to do calculations only really one time when it's needed, so maybe write extra triangle
 routines for fixed color versions, texture mapped versions and so on to not have to convert a pixel color again and again even 
 if it's static for the complete triangle (hence you spend doing work 
 again and again for a fixed value). 

 Hope that helps :)

	I want to show how you can optimize the rendering functions without going into changes on the upper level (like removing round
	stuff and so on), but I want to teach also the ideas behind the optimizations I'm proposing. I have no esp32 board to test my
	changes, to I will try to teach the ideas and let others test the optimizations. And I don't want to write an fast software
	texture mapper (again) without explaining why it's much faster than the current source code.

	1.) optimizing color conversion
	this should be done automatically by your compiler (hopefully, has to be checked at assembly level!), but some background:
	multiplications and divides can be optimized if they're a power of 2 by bit-shifts, means divison by 255 can also be done with
	bitshit of 8 bits to the right. so looking at the first color conversion routine:
	return (((c * 31) / 255) << 11) \|
	(((c * 63) / 255) << 5) \|
	((c * 31) / 255);
	is basically: (((c31)>>8)<<11) and some bitshifts can be removed because you can also do ((c31)<<3)&0xf0 (the and-operation
	is needed to clear bits originally cleared by the shift-operation). next thing (should also be done by your compiler): on
	lower cpus multiplications are expensive and can be substituted with additions and shifts, for example a*31 can also be written
	as ((a<<5)-a) which could be faster depending on the timings of your multiplication assembly command (a<<5 calculates a*32 and
	subtracting one a to get the correct result).

	2.) optimize your textures
	You currently allocate for every line in your texture an extra column-memory which is overkill and hinders other optimizations
	(https://github.com/LAK132/ImDuino/blob/master/softraster.h#L176). This should be changed to allocate one big block for the
	complete texture (something like buffer = malloc(xysizeof(one_pixel)). Allocating columns instead of rows is especially bad
	for your cache (if the esp32 has a cache ;) because you will trash your cache most likely in EVERY pixel access. and if you have
	a continues block if pixel data you can step from line to line with a defined offset (usually the line-width) and not fragmented
	memory (with memory-allocation-information inbetween).

	3.) optimize renderPixel
	by removing the function. if you use a renderPixel function to draw each single pixel you have a lot of calculations done per
	pixel which are only needed one time or per each horizontal line (horizontal to please you cache). A little example:
	for(int y=0; y<h; ++y) {
	for(int x=0; x<w; ++x) {
	setPixel(x,y,c);
	}
	}
	Hidden here are a lot of duplicate calls like color conversion for each pixel and pixel-destination offset calculation, much
	better would be:
	uint8_t c = col32to8(pixel->c);
	uint8_t p = tex_memory; //here you can jump one time to the correct pixel location in case you have offsets
	for(int y=0; y<h; ++y) {
	for(int x=0; x<w; ++x) {
	*(p++) = c;
	}
	//here you can also use an offset to jump to the next line in the case you don't want to fill the whole line
	}
	Here you have the color conversion one time and you can calculate the address you want to write to one time (maybe with offsets
	for x,y positions), but only one time during setup and not for each pixel more or less the same calculation (calculating the
	destination address usually involves a multiplikation like yPos*width+x which is in this case also only needed one time. and jumping
	from one line to the next one is also faster if you have a big block for your screen and is only a addition instead of a multiplikation
	when using the complete offset-calculation for each line.)

	4.) optimize renderLine
	first thing: you first check if the line is outside the screen and then if it's outside of the clipRectangle. Combine that so
	if the clipRectangle is not set assign the screencoords to the clipRectangle to remove one check. Here the same applies to like
	the setPixel function, inline the setPixel function directly here to calculate screen-offsets only one time and increment the
	pointer to the pixel if you want to go to the next pixel in the line instead of recalculating the complete offset into the buffer.
	and calculate the converted color only one time for the whole line instead of per pixel.
	Another approach could be to use specialized renderLine versions for each pixel-format you have to have a fast inner loop.

	5.) optimize renderTriangle
	here I'm not sure what is done here, but for me it seems there are cases with 4 square-root calls per line which is really
	expensive! Usually you calculate your delta's two times for a normal triangle and only step these values per line (and pixel
	for textures) without using perspective correct textures (which are not needed here in this more or less 2d case). some more
	background in this very old but still useful article about texture mapping: http://www.multi.fi/~mbc/sources/fatmap.txt

	In general: try to keep your inner loops as fast as possible, in this case your inner loops are (from most important to optimize
	to not so important: set-pixel-routine which should never be a single function, pixel-innerloop in horzontal-line routine,
	line-routine and triangle setup. And try to do calculations only really one time when it's needed, so maybe write extra triangle
	routines for fixed color versions, texture mapped versions and so on to not have to convert a pixel color again and again even
	if it's static for the complete triangle (hence you spend doing work
	again and again for a fixed value).

	Hope that helps :)