Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read performance with/without missing #264

Open
milankl opened this issue Sep 2, 2024 · 3 comments
Open

Read performance with/without missing #264

milankl opened this issue Sep 2, 2024 · 3 comments

Comments

@milankl
Copy link

milankl commented Sep 2, 2024

(Motivated from #227 (comment))

Creating a fake dataset with some compression like

using NCDatasets
A = rand(Float32, 5000, 5000)    # 100MB uncompressed
sort!(vec(A))                    # make it somewhat compressible

ds = NCDataset("test.nc", "c")
defVar(ds, "data", A, ("x", "y"), attrib = Dict("_FillValue"=>NaN32), deflatelevel=3)
close(ds)

This file is now 24.8MB on disk so ~4x compression factor. Now benchmark the read + decompression

using NCDatasets, BenchmarkTools
ds = NCDataset("test.nc")
  1. Read and uncompress raw data, ignore any missing with .var
julia> @btime A = $ds["data"].var[:];
  533.270 ms (54 allocations: 95.37 MiB)

So almost 200MB/s and it only allocates that 100MB that the uncompressed array requires.

  1. Read and uncompress as by default, returning Matrix{Union{Missing, Float32}}
julia> @btime A = $ds["data"][:];
  641.664 ms (55 allocations: 214.58 MiB)

only bit slower but requires more than twice the memory

  1. Read and uncompress via nomissing(::CFVariable)
julia> @btime A = nomissing($ds["data"])

Takes absolutely forever, don't do this. See #227 (comment) -- maybe add a warning or remove the nomissing(::CFVariable) method?

  1. Read and uncompress via nomissing(::Array)
julia> @btime A = nomissing($ds["data"][:]);
  712.682 ms (57 allocations: 309.95 MiB)

Bit slower again and 3x the allocations.

  1. Read and uncompress via Array(::CFVariable)
julia> @btime A = Array($ds["data"])
  496.846 ms (64 allocations: 214.58 MiB)

Same as (2) but faster?

  1. Read and uncompress via Array{T}(::CFVariable) but providing target type T
julia> @btime A = Array{Float32}($ds["data"])

Don't do this, also takes forever, probably same as (3).

@milankl
Copy link
Author

milankl commented Sep 2, 2024

So question is, where do these additional allocations come from? Could one not write a nomissing function similar to

function ignoremissing(dsvar)
     # there's probably a more elegant way to write a shape-preserving raw data read?
     # this is just because one doesn't know whether to write .var[:, :] or .var[:, :, :], ... depending on dims
     raw = reshape(dsvar.var[:], (Base.OneTo(size(dsvar, i)) for i in 1:ndims(dsvar))...)
end

function nomissing(dsvar)
     raw = ignoremissing(dsvar)    # read and allocate the array once

     # warn if there's missing value 
     missing_value = dsvar.attrib["_FillValue"]
     missing_value in raw && @warn "Missing value in data"
     return raw
end

Benchmarking this is as fast as the raw data read and only allocates the array once.

julia> @btime A = nomissing($ds["data"]);
  490.540 ms (58 allocations: 95.37 MiB)

@milankl
Copy link
Author

milankl commented Sep 2, 2024

Just realised that in cannot check for NaN's (see also JuliaLang/julia#37157) in which case we probably need something like

function value_in(val, collection)
    return !isnothing(findfirst(x -> x === val, collection))
end

(UPDATE: any(x -> x === val, collection) seems to be another option.)

which however, currently is some 50% slower but at least not allocating (which I find the higher priority working with datasets)

julia> @btime A = nomissing2($ds["data"]);
  881.641 ms (59 allocations: 95.37 MiB)

@Alexander-Barth
Copy link
Owner

Alexander-Barth commented Sep 2, 2024

If I understand you well, your use case would be to load as efficiently as possibly an array of floats (with a _FillValue set), and replace the missing value to NaNs and the call ds["data"].var[:,:] is not sufficient because it does not work when the _FillValue attribute in the NetCDF file it set to something different than NaN ?

Within the current API, one can use:

ncv2 = cfvariable(ds,"data",maskingvalue=NaN32)                                                                                                                
@btime A = $ncv2[:]; # or Array(ncv2) to preserve the shape                                                                                                                                           
# output  341.959 ms (28 allocations: 190.74 MiB)                                                                                                                             

With a small specialization (JuliaGeo/CommonDataModel.jl@ba34d89) for the case where the raw data type == transformed data type, I can get this down to:

@btime A = $ncv2[:];                                                                                                                                           
# output  316.530 ms (26 allocations: 95.37 MiB)

@btime A = Array($ncv2);                                                                                                                                       
# output  320.732 ms (35 allocations: 95.37 MiB)

With is the same amount of memory that your use case. Would that work for you?

If for some reason the element type in the NetCDF variable changes to Int32 but with a scale factor of say 1f-3, the code
cfvariable(ds,"data",maskingvalue=NaN32) would still work, but again, needing two large arrays (one Int32 array and one Float32 array).

The keywords of cfvariable allows you to selectively enable and disable some transformation according the CF conventions (the attributes scale_factor, add_offset, _FillValue, missing_value (similar to _FillValue but a list of values can be specified), date time conversion via the unit attribute). It allows to you set any value (missing, NaN, NaN32, nothing, even regular values like 42 even if there is a scale factor or offset) as special value corresponding to the _FillValue attribute.

Concerning "2. ...only bit slower but requires more than twice the memory": yes, there is one array for the raw data and one array for the scaled data following the CF convention. I agree that in this particular case, the second array is not needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants