Description
In a number of game engines I've been optimizing and benchmarking, interestingly the performance of memcpy() does show up relatively high in profiles. (~2%-5% of total execution time)
I did a recent optimization to memcpy in Emscripten, and added some benchmarks. See emscripten-core/emscripten#4127 (especially http://clb.demon.fi/emcc/memcpy_test/results.html)
The result there was that asm.js+SIMD memcpys are at worst ~3x slower than native at aligned copy sizes around 4k - 16k bytes, and Wasm memcpys are at worst ~6x slower than native at aligned copy sizes around the same size.
It is not immediately obvious to me how much we can close this gap e.g. by further optimizing the memcpy implementation in Emscripten, and optimizing Wasm implementations in browsers, which raises the question of whether it would make sense to have a memcpy (and probably as extension, memset) opcode directly in wasm?
Currently in Emscripten very large memcpys round trip to HEAP8.set(), but it has a drawback of generating temporary garbage.