Created
April 6, 2015 22:42
-
-
Save rcombs/a415e995109400e4f9cf to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This issue behaves as intended based on the following: | |
REP STOSB is significantly faster than AVX stores when streaming through the LLC to memory, because it uses a special store semantic that avoids the need to read-for-ownership. There’s no way to get this semantic other than using REP STOSB. We can go faster than REP STOSB using VMOVAPS when hitting L1 cache, which is why we use it for small buffers only. | |
Note that REP STOSB *is* slower than VMOVAPS when the buffer is memory marked uncacheable. Unfortunately, there’s no way for bzero( ) to efficiently check how the memory is mapped, and switch into a different implementation, so when working with uncacheable memory you will want to use your own zeroing implementation, implemented using either VMOVAPS or VMOVNTDQA. Note that you should be using your own zeroing implementation with noncacheable memory anyway, as the behavior of memset( ) is formally undefined by the C standard when using memory that didn’t come from “normal” C (stack or malloc) allocations. | |
Please update your bug report to let us know if this is still an issue for you. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment