Skip to content

Conversation

@johnplatts
Copy link
Contributor

Added the StreamLoad op as SSE4/AVX2/AVX3/PPC have non-temporal aligned load instructions for vectors that are 16 bytes or larger and as SVE has non-temporal load instructions for all vector sizes.

@jan-wassenberg
Copy link
Member

I'm concerned about performance and correctness on x86. _mm_stream_load_si128 is super slow (hundreds of cycles) and only really intended for WC memory i.e. memory mapped I/O. It does seem useful for drivers that actually do want to bulk-load from WC: https://community.intel.com/t5/Intel-ISA-Extensions/Do-Non-Temporal-Loads-Prefetch/m-p/1027104
Is that the intended use case?

If so, then we also have errata HSD162, BDM116 and SKL079 to deal with, concerning ordering with respect to LOCK and MFENCE. Possibly we can just document that.

If it's rather the hope that when we load from normal WB memory, that the cache line is marked as preferred for discarding, do we have evidence of a benefit? The past few times I've tried this and similar things, I was disappointed.

Possible options: rely on prefetches to set the hint we'd like before the actual load, and/or make the x86 StreamLoad equivalent to Load if you'd still like to target the SVE instruction. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants