Skip to content

Vectorized Array writing improves its speed #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jan 22, 2021
Merged

Conversation

ehgus
Copy link
Collaborator

@ehgus ehgus commented Dec 13, 2020

Hello again!
I recently suffered from slow writing speed for 1.8Gb 2D-stacked tomography mrc file.
I figured out iterative array writing is the bottleneck, so I resolve this situation by utilizing vectorzied operations.

For benchmark, I put @time to every single lines of write method and compare the benchmark to my vectorizied code.

function Base.write(io::IO, d::MRCData; compress = :none)
    @time newio = compressstream(io, compress)
    @time h = header(d)
    @time sz = write(newio, h)
    @time sz += write(newio, extendedheader(d))
    @time T = datatype(h)
    @time data = parent(d)
    @time fswap = bswapfromh(h.machst) 
    # @time write(newio,(fswap.(T.(data))))   ### this is my vectorized code that will replace iteration
    @time begin
    for i in eachindex(data)
        @inbounds sz += write(newio, fswap(T(data[i])))
    end
    @time close(newio)
    return sz
end

This is the test code

using MRC

file_name="HeLa.mrc"

for _ in 1:3
    # read and write
    orig_file = read(file_name,MRCData)
    write("copy_$(file_name)",orig_file)
    println("---- done writing ----")
    # compare orig and copied 
    copied_file = read("copy_$(file_name)",MRCData)
    println("writing correctness: $(copied_file == orig_file) \n")
end

Here is the benchmark for 1.8G 2D-stacked tomography mrc file. I listed last of the three repeated test results:

original code

0.000068 seconds (7 allocations: 16.344 KiB)
0.000000 seconds
0.000396 seconds (138 allocations: 11.688 KiB)
0.000117 seconds (2 allocations: 32 bytes)
0.000004 seconds
0.000000 seconds
0.000000 seconds
327.057910 seconds (2.82 G allocations: 42.000 GiB, 4.02% gc time)
0.000108 seconds
---- done writing ----
writing correctness: true

vectorzied code

0.000023 seconds (7 allocations: 16.344 KiB)
0.000000 seconds
0.000075 seconds (138 allocations: 11.688 KiB)
0.000064 seconds (2 allocations: 32 bytes)
0.000001 seconds
0.000000 seconds
0.000000 seconds
2.345554 seconds (8 allocations: 1.750 GiB, 12.91% gc time)
11.644342 seconds
---- done writing ----
writing correctness: true

I also have tested with http://ftp.rcsb.org/pub/emdb/structures/EMD-5778/map/emd_5778.map.gz:

original code

0.000229 seconds (27 allocations: 34.172 KiB)
0.000000 seconds
0.000311 seconds (134 allocations: 11.688 KiB)
0.000016 seconds (1 allocation: 16 bytes)
0.000006 seconds
0.000000 seconds
0.000000 seconds
8.186541 seconds (50.33 M allocations: 768.000 MiB, 2.05% gc time)
0.013294 seconds
---- done writing ----
writing correctness: true

vectorzied code

0.000043 seconds (27 allocations: 34.172 KiB)
0.000000 seconds
0.000065 seconds (134 allocations: 11.688 KiB)
0.000003 seconds (1 allocation: 16 bytes)
0.000001 seconds
0.000000 seconds
0.000000 seconds
3.910012 seconds (8 allocations: 64.000 MiB, 0.49% gc time)
0.009102 seconds
---- done writing ----
writing correctness: true

@ehgus ehgus changed the title Vectorized writing Vectorized Array writing improves its speed Dec 13, 2020
@sethaxen
Copy link
Owner

Thanks for the PR! I can confirm that I'm seeing roughly a 60-fold speed-up with this fix, which is quite surprising, since the old version did not copy the data, while this change does. I thought maybe there was a speed-up for AbstractArrays, but I tried MappedArrays and saw only a 10-fold speed-up.

How we should probably handle this is to allocate a large but reasonably sized buffer array of type T and copy to it slices of fswaped and converted outputs, which are each written to the file. Then we get the best of both worlds: fewer and smaller allocations, with vectorized io. This will be critical for writing memmapped maps, where presumably the map would take up too much memory to load. Then a keyword argument could allow the user to change the buffer size.

Would you like to tackle this? If not, I can open an alternate PR.

@ehgus
Copy link
Collaborator Author

ehgus commented Dec 17, 2020

I come up with splitting a big array into 4kb(4096byte) vectors.

why 4kb?

The reason vectorized operation is more faster is that cpu offers the SIMD instructions. What I know is that the operation handles 128~258bytes of data at once. So I guess julia will fully utilize vectorized instructions when its size is over 512bytes.

Since mmap loads file with 4kb increments, which is default memory page size, doing vectorized operation for each 4kb vectors seem fair enough.

The following is the benchmark of write for both normal file and mmapped file (1.8Gb file)

normal file

0.000011 seconds (7 allocations: 16.344 KiB)
0.000000 seconds
0.000088 seconds (138 allocations: 11.828 KiB)
0.000065 seconds (2 allocations: 32 bytes)
0.000001 seconds
0.000000 seconds
0.000000 seconds
4.080435 seconds (4.13 M allocations: 3.685 GiB, 11.04% gc time)
11.232843 seconds
---- done writing ----
writing correctness: true

mmapped file

0.000038 seconds (7 allocations: 16.344 KiB)
0.000000 seconds
0.000256 seconds (138 allocations: 11.828 KiB)
0.000059 seconds (2 allocations: 32 bytes)
0.000001 seconds
0.000000 seconds
0.000000 seconds
4.276065 seconds (4.13 M allocations: 3.685 GiB, 4.02% gc time)
10.974639 seconds
---- done writing ----
writing correctness: true

You see that allocation size become twice compared to simple vectorized operation. I believe it causes a slight performance drop for normal file writing, but still acceptable for me. I guess array to subarray operation makes a copy for preventing change of array size while writing.

Copy link
Owner

@sethaxen sethaxen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed review! The approach looks right. I had a few suggestions.
I'm not sure about the logic for writing in 4kb chunks. As suggested, can you make the chunk size a keyword argument? Also it would be nice to have benchmarks of different values of the chunk size ranging from small to e.g. 8mb using BenchmarkTools. @timeit isn't that reliable for benchmarking. We don't need to benchmark each line, rather just the entire write function for some MRCData.

Lastly, unfortunately this repo currently doesn't have a good test suite, but could you add a test that for different chunk sizes, the same data is written to the test file? It could just be a roundtrip test that simulates a map, writes it to a file with two different chunk sizes, then checks that when read in they are the same.

src/io.jl Outdated
@@ -99,8 +99,14 @@ function Base.write(io::IO, d::MRCData; compress = :none)
T = datatype(h)
data = parent(d)
fswap = bswapfromh(h.machst)
for i in eachindex(data)
@inbounds sz += write(newio, fswap(T(data[i])))
unit_vsize = 4096 ÷ sizeof(T)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if the cache size (either 4096 or unit_vsize) was a keyword argument that a user could select. This would also be handy for testing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback! It's good idea to set cache size keyword and it would be better for maintenence.

ehgus and others added 3 commits January 14, 2021 14:17
Co-authored-by: Seth Axen <seth.axen@gmail.com>
Co-authored-by: Seth Axen <seth.axen@gmail.com>
Co-authored-by: Seth Axen <seth.axen@gmail.com>
@ehgus
Copy link
Collaborator Author

ehgus commented Jan 21, 2021

I haved benchmarked write performance depending its unit_vsize.

workstation spec

  • Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
  • more than 100Gb RAM
  • centos 7

Laptop spec

  • 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
  • 8Gb RAM
  • windows 10

bench_workstation

bench_laptop

x axis is log scale of unit_vsize, in which 12 mean 4kb(2^12 bytes = 4kb ). Let's see the laptop case first. You can see a peak at 2^14bytes' unit_vsize. Except for this, the write speed increases as unit_vsize increase, and memory starvation may occur when its size over 1Mb.
Combining the result, 4kb unit_vsize looks reasonable unit_vsize.

@codecov
Copy link

codecov bot commented Jan 22, 2021

Codecov Report

Merging #6 (ad3308c) into master (75a4bfc) will decrease coverage by 0.67%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master       #6      +/-   ##
==========================================
- Coverage   37.17%   36.49%   -0.68%     
==========================================
  Files           6        6              
  Lines         269      274       +5     
==========================================
  Hits          100      100              
- Misses        169      174       +5     
Impacted Files Coverage Δ
src/io.jl 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 75a4bfc...ad3308c. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants