rcombs · April 6, 2015 22:42
diff --git a/gistfile1.txt b/gistfile1.txt
 This issue behaves as intended based on the following:

 REP STOSB is significantly faster than AVX stores when streaming through the LLC to memory, because it uses a special store semantic that avoids the need to read-for-ownership.  There’s no way to get this semantic other than using REP STOSB.  We can go faster than REP STOSB using VMOVAPS when hitting L1 cache, which is why we use it for small buffers only.

 Note that REP STOSB *is* slower than VMOVAPS when the buffer is memory marked uncacheable.  Unfortunately, there’s no way for bzero( ) to efficiently check how the memory is mapped, and switch into a different implementation, so when working with uncacheable memory you will want to use your own zeroing implementation, implemented using either VMOVAPS or VMOVNTDQA.  Note that you should be using your own zeroing implementation with noncacheable memory anyway, as the behavior of memset( ) is formally undefined by the C standard when using memory that didn’t come from “normal” C (stack or malloc) allocations.

 Please update your bug report to let us know if this is still an issue for you.
	This issue behaves as intended based on the following:

	REP STOSB is significantly faster than AVX stores when streaming through the LLC to memory, because it uses a special store semantic that avoids the need to read-for-ownership. There’s no way to get this semantic other than using REP STOSB. We can go faster than REP STOSB using VMOVAPS when hitting L1 cache, which is why we use it for small buffers only.

	Note that REP STOSB is slower than VMOVAPS when the buffer is memory marked uncacheable. Unfortunately, there’s no way for bzero( ) to efficiently check how the memory is mapped, and switch into a different implementation, so when working with uncacheable memory you will want to use your own zeroing implementation, implemented using either VMOVAPS or VMOVNTDQA. Note that you should be using your own zeroing implementation with noncacheable memory anyway, as the behavior of memset( ) is formally undefined by the C standard when using memory that didn’t come from “normal” C (stack or malloc) allocations.

	Please update your bug report to let us know if this is still an issue for you.