Whenever I write code, I habitually try to minimize the amount of memory allocations required for that code to fulfill its purpose.
Memory allocations are very slow:
- They are complicated. So the CPU has to execute a bunch of instructions inside
malloc
. - They are complicated. The mass of instructions occupies bunch of memory, which then must occupy the CPU's instruction cache (icache) whenever it is executed. This will (probably) be a cache miss to load the allocator, and then it will evict a bunch of "hot" code that will then cache miss again once the allocator is done. Cache misses are SLOW.
- If the allocator can't retrieve memory from its own pool, it will need to use a system call and execute kernel code to retrieve more memory. Suddenly we've gone from idyllic single-threaded concerns, to "every thread on the host machine wants this thing that you're accessing" kind of bigmess.