Vectorized xoshiro256++

There is some discussion on how to vectorize xoshiro256++ at https://github.com/JuliaLang/julia/issues/27614. The method relies on interleaving 4 xoshiro256++ generators. I implemented it and the results are impressive (see `gen_bytes_fill`):

```
test gen_bytes_chacha12           ... bench:     324,494 ns/iter (+/- 4,236) = 3155 MB/s
test gen_bytes_chacha20           ... bench:     490,442 ns/iter (+/- 16,214) = 2087 MB/s
test gen_bytes_chacha8            ... bench:     243,010 ns/iter (+/- 19,972) = 4213 MB/s
test gen_bytes_fill               ... bench:     105,350 ns/iter (+/- 1,456) = 9719 MB/s
test gen_bytes_pcg64mcg           ... bench:     321,665 ns/iter (+/- 7,854) = 3183 MB/s
test gen_bytes_splitmix64         ... bench:     233,973 ns/iter (+/- 1,859) = 4376 MB/s
test gen_bytes_xoshiro256plusplus ... bench:     343,911 ns/iter (+/- 6,580) = 2977 MB/s
```

The implementation is 3.3 time faster than the non-vectorized xoshiro256++ generator and more than 2.2 times faster than splitmix64 or chacha8. It is also faster than dSFMT. However, the size of the state is blown up to 128 bytes, which is almost as large as chacha's state (136 bytes).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Vectorized xoshiro256++ #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Vectorized xoshiro256++ #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions