Read performance with/without missing #264

milankl · 2024-09-02T15:03:36Z

Creating a fake dataset with some compression like

using NCDatasets
A = rand(Float32, 5000, 5000)    # 100MB uncompressed
sort!(vec(A))                    # make it somewhat compressible

ds = NCDataset("test.nc", "c")
defVar(ds, "data", A, ("x", "y"), attrib = Dict("_FillValue"=>NaN32), deflatelevel=3)
close(ds)

This file is now 24.8MB on disk so ~4x compression factor. Now benchmark the read + decompression

using NCDatasets, BenchmarkTools
ds = NCDataset("test.nc")

Read and uncompress raw data, ignore any missing with .var

julia> @btime A = $ds["data"].var[:];
  533.270 ms (54 allocations: 95.37 MiB)

So almost 200MB/s and it only allocates that 100MB that the uncompressed array requires.

Read and uncompress as by default, returning Matrix{Union{Missing, Float32}}

julia> @btime A = $ds["data"][:];
  641.664 ms (55 allocations: 214.58 MiB)

only bit slower but requires more than twice the memory

Read and uncompress via nomissing(::CFVariable)

julia> @btime A = nomissing($ds["data"])

Takes absolutely forever, don't do this. See #227 (comment) -- maybe add a warning or remove the nomissing(::CFVariable) method?

Read and uncompress via nomissing(::Array)

julia> @btime A = nomissing($ds["data"][:]);
  712.682 ms (57 allocations: 309.95 MiB)

Bit slower again and 3x the allocations.

Read and uncompress via Array(::CFVariable)

julia> @btime A = Array($ds["data"])
  496.846 ms (64 allocations: 214.58 MiB)

Same as (2) but faster?

Read and uncompress via Array{T}(::CFVariable) but providing target type T

julia> @btime A = Array{Float32}($ds["data"])

Don't do this, also takes forever, probably same as (3).

The text was updated successfully, but these errors were encountered:

milankl · 2024-09-02T16:46:36Z

So question is, where do these additional allocations come from? Could one not write a nomissing function similar to

function ignoremissing(dsvar)
     # there's probably a more elegant way to write a shape-preserving raw data read?
     # this is just because one doesn't know whether to write .var[:, :] or .var[:, :, :], ... depending on dims
     raw = reshape(dsvar.var[:], (Base.OneTo(size(dsvar, i)) for i in 1:ndims(dsvar))...)
end

function nomissing(dsvar)
     raw = ignoremissing(dsvar)    # read and allocate the array once

     # warn if there's missing value 
     missing_value = dsvar.attrib["_FillValue"]
     missing_value in raw && @warn "Missing value in data"
     return raw
end

Benchmarking this is as fast as the raw data read and only allocates the array once.

julia> @btime A = nomissing($ds["data"]);
  490.540 ms (58 allocations: 95.37 MiB)

milankl · 2024-09-02T16:50:40Z

Just realised that in cannot check for NaN's (see also JuliaLang/julia#37157) in which case we probably need something like

function value_in(val, collection)
    return !isnothing(findfirst(x -> x === val, collection))
end

(UPDATE: any(x -> x === val, collection) seems to be another option.)

which however, currently is some 50% slower but at least not allocating (which I find the higher priority working with datasets)

julia> @btime A = nomissing2($ds["data"]);
  881.641 ms (59 allocations: 95.37 MiB)

Alexander-Barth · 2024-09-02T20:28:19Z

If I understand you well, your use case would be to load as efficiently as possibly an array of floats (with a _FillValue set), and replace the missing value to NaNs and the call ds["data"].var[:,:] is not sufficient because it does not work when the _FillValue attribute in the NetCDF file it set to something different than NaN ?

Within the current API, one can use:

ncv2 = cfvariable(ds,"data",maskingvalue=NaN32)                                                                                                                
@btime A = $ncv2[:]; # or Array(ncv2) to preserve the shape                                                                                                                                           
# output  341.959 ms (28 allocations: 190.74 MiB)

With a small specialization (JuliaGeo/CommonDataModel.jl@ba34d89) for the case where the raw data type == transformed data type, I can get this down to:

@btime A = $ncv2[:];                                                                                                                                           
# output  316.530 ms (26 allocations: 95.37 MiB)

@btime A = Array($ncv2);                                                                                                                                       
# output  320.732 ms (35 allocations: 95.37 MiB)

With is the same amount of memory that your use case. Would that work for you?

If for some reason the element type in the NetCDF variable changes to Int32 but with a scale factor of say 1f-3, the code
cfvariable(ds,"data",maskingvalue=NaN32) would still work, but again, needing two large arrays (one Int32 array and one Float32 array).

The keywords of cfvariable allows you to selectively enable and disable some transformation according the CF conventions (the attributes scale_factor, add_offset, _FillValue, missing_value (similar to _FillValue but a list of values can be specified), date time conversion via the unit attribute). It allows to you set any value (missing, NaN, NaN32, nothing, even regular values like 42 even if there is a scale factor or offset) as special value corresponding to the _FillValue attribute.

Concerning "2. ...only bit slower but requires more than twice the memory": yes, there is one array for the raw data and one array for the scaled data following the CF convention. I agree that in this particular case, the second array is not needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read performance with/without missing #264

Read performance with/without missing #264

milankl commented Sep 2, 2024

milankl commented Sep 2, 2024

milankl commented Sep 2, 2024 •

edited

Loading

Alexander-Barth commented Sep 2, 2024 •

edited

Loading

Read performance with/without missing #264

Read performance with/without missing #264

Comments

milankl commented Sep 2, 2024

milankl commented Sep 2, 2024

milankl commented Sep 2, 2024 • edited Loading

Alexander-Barth commented Sep 2, 2024 • edited Loading

milankl commented Sep 2, 2024 •

edited

Loading

Alexander-Barth commented Sep 2, 2024 •

edited

Loading