-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Materialize DimArray
or DimStack
From a Table
#739
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Rafael Schouten <rafaelschouten@gmail.com>
…a.jl into materialize
There's still a few questions that need resolving:
|
I think it's best to test that the geometry column's elements have Going forward to get the geometry column it's probably best to have |
DimensionalData.jl does not depend on GeoInterface |
Being able to interact with geometries in a generic fashion would help us interop with other packages. I also see that |
It's not about deps or timing, it's about clean feature scope. The promise here is that Rasters (and YAX) have all the geo deps and features, so non-geo people don't have to worry about them. Some of the biggest contributors here use DD for unrelated fields. I would ignore point columns here entirely and instead write the code so it's easy for Rasters to handle them. Taking the underscores off a few functions and documenting them as an real interface will help that. |
I've updated the docstrings for both the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good, but a few minor changes are needed for it to be correct. Constructed selectors should be allowed so At can have atol
. We can just construct selector types at the outer level (they're just filled with nothing
.
Then there is a bit more dispatch needed to make the fast paths correct for At/Near/Contains with their standard behaviour.
Co-authored-by: Rafael Schouten <rafaelschouten@gmail.com>
Co-authored-by: Rafael Schouten <rafaelschouten@gmail.com>
…a.jl into materialize
Co-authored-by: Rafael Schouten <rafaelschouten@gmail.com>
Co-authored-by: Rafael Schouten <rafaelschouten@gmail.com>
Co-authored-by: Rafael Schouten <rafaelschouten@gmail.com>
Yeah we probably need a tolerance check. We could allow specifying if some dimensions are not sorted, like |
I've implemented the Here's an example of it in action: julia> xdims = X(LinRange{Float64}(610000.0, 661180.0, 2560));
julia> ydims = Y(LinRange{Float64}(6.84142e6, 6.79024e6, 2560));
julia> bdims = Dim{:Band}([:B02, :B03, :B04]);
julia> d = DimArray(rand(UInt16, 2560, 2560, 3), (xdims, ydims, bdims));
julia> t = DataFrame(d);
julia> t_rand = Random.shuffle(t);
julia> dims(d)
↓ X Sampled{Float64} LinRange{Float64}(610000.0, 661180.0, 2560) ForwardOrdered Regular Points,
→ Y Sampled{Float64} LinRange{Float64}(6.84142e6, 6.79024e6, 2560) ReverseOrdered Regular Points,
↗ Band Categorical{Symbol} [:B02, :B03, :B04] ForwardOrdered
julia> DD.guess_dims(t)
↓ X Sampled{Float64} LinRange{Float64}(610000.0, 661180.0, 2560) ForwardOrdered Regular Points,
→ Y Sampled{Float64} LinRange{Float64}(6.84142e6, 6.79024e6, 2560) ReverseOrdered Regular Points,
↗ Band Categorical{Symbol} [:B02, :B03, :B04] ForwardOrdered
julia> DD.guess_dims(t, (X, Y, :Band))
↓ X Sampled{Float64} LinRange{Float64}(610000.0, 661180.0, 2560) ForwardOrdered Regular Points,
→ Y Sampled{Float64} LinRange{Float64}(6.84142e6, 6.79024e6, 2560) ReverseOrdered Regular Points,
↗ Band Categorical{Symbol} [:B02, :B03, :B04] ForwardOrdered
julia> DD.guess_dims(t_rand, (X => DD.ForwardOrdered(), Y => DD.ReverseOrdered(), :Band => DD.ForwardOrdered()))
↓ X Sampled{Float64} LinRange{Float64}(610000.0, 661180.0, 2560) ForwardOrdered Regular Points,
→ Y Sampled{Float64} LinRange{Float64}(6.84142e6, 6.79024e6, 2560) ReverseOrdered Regular Points,
↗ Band Categorical{Symbol} [:B02, :B03, :B04] ForwardOrdered
julia> DD.guess_dims(t_rand[1:10000,:], (X => DD.ForwardOrdered(), Y => DD.ReverseOrdered(), :Band => DD.ForwardOrdered()))
↓ X Sampled{Float64} LinRange{Float64}(610000.0, 661180.0, 2560) ForwardOrdered Regular Points,
→ Y Sampled{Float64} LinRange{Float64}(6.84142e6, 6.79024e6, 2560) ReverseOrdered Regular Points,
↗ Band Categorical{Symbol} [:B02, :B03, :B04] ForwardOrdered As can be seen in the last example, |
This is awesome to have, thanks. It may be a bit before I can fully review and test it but it looks good from a quick skim. |
src/array/array.jl
Outdated
@@ -429,6 +430,13 @@ function DimArray(A::AbstractBasicDimArray; | |||
newdata = collect(data) | |||
DimArray(newdata, format(dims, newdata); refdims, name, metadata) | |||
end | |||
# Write a single column from a table with one or more coordinate columns to a DimArray | |||
function DimArray(table, dims; name=NoName(), selector=DimensionalData.Near(), precision=6, kw...) | |||
data = restore_array(table, dims; selector=selector, missingval=missing, name=name, precision=precision) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably do a Tables.istable
check at this point for a nicer error and stack trace, rather than failing deep in the internals.
@@ -420,5 +421,12 @@ function DimStack(data::NamedTuple, dims::Tuple; | |||
all(map(d -> axes(d) == axes(first(data)), data)) || _stack_size_mismatch() | |||
DimStack(data, format(dims, first(data)), refdims, layerdims, metadata, layermetadata) | |||
end | |||
# Write each column from a table with one or more coordinate columns to a layer in a DimStack | |||
function DimStack(table, dims::Tuple; selector=DimensionalData.Contains(), kw...) | |||
data_cols = _data_cols(table, dims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again we probably need a Tables.istable
check here
src/table_ops.jl
Outdated
end | ||
|
||
# Determine the ordinality of a set of coordinates | ||
_coords_to_ords(coords::AbstractVector, dim::DD.Dimension, sel::DD.Selector) = _coords_to_ords(coords, dim, sel, DD.locus(dim), DD.span(dim)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all of these DD and DimensionalData qualifiers needed?
_guess_dims(coords::AbstractVector, dim::DD.Dimension, args...) = dim | ||
_guess_dims(coords::AbstractVector, dim::Type{<:DD.Dimension}, args...) = _guess_dims(coords, DD.name(dim), args...) | ||
_guess_dims(coords::AbstractVector, dim::Pair, args...) = _guess_dims(coords, first(dim), last(dim), args...) | ||
function _guess_dims(coords::AbstractVector, dim::Symbol, precision::Int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be AbstractVector{Union{Number,DateTime}
?
I'm wondering what happens to strings, symbols and other objects that need to go in Categorical
lookups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should work with almost any type. If the coordinates are non-numerical, then we will internally dispatch on the following methods:
# Extract all unique coordinates from the given vector
_unique_vals(coords::AbstractVector, precision::Int) = _round_dim_val.(coords, precision) |> unique
# Round dimension value within the specified precision
_round_dim_val(x, ::Int) = x
# Determine if the given coordinates are forward ordered, reverse ordered, or unordered
function _guess_dim_order(coords::AbstractVector)
if issorted(coords)
return DD.ForwardOrdered()
elseif issorted(coords, rev=true)
return DD.ReverseOrdered()
else
return DD.Unordered()
end
end
# Estimate the span between consecutive coordinates
_guess_dim_span(::AbstractVector, ::DD.Order, ::Int) = DD.Irregular()
_unique_vals
will just return all unique values, where _round_dim_val
is just the identity function for non-numerical coordinates.
_guess_dim_order
should work for anything that can be sorted, which is the case for both String
and Symbol
.
_guess_dim_span
will return DD.Irregular()
for non-numerical coordinates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to need a try catch for issorted
in case <
is not defined for a type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also find regular spans for Dates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should let us handle data that can't be sorted:
function _guess_dim_order(coords::AbstractVector)
try
if issorted(coords)
return DD.ForwardOrdered()
elseif issorted(coords, rev=true)
return DD.ReverseOrdered()
else
return DD.Unordered()
end
catch
return DD.Unordered()
end
end
And this should retrieve the span from Date
and DateTime
objects:
function _guess_dim_span(coords::AbstractVector{<:Dates.AbstractTime}, ::DD.Ordered, precision::Int)
steps = (@view coords[2:end]) .- (@view coords[1:end-1])
span = argmin(abs, steps)
return all(isinteger, round.(steps ./ span, digits=precision)) ? DD.Regular(span) : DD.Irregular()
end
However, there seems to be a problem with constructing a LinRange
from Date
objects:
julia> vals = [Date("2022-11-16") + Day(i * 7) for i in 0:4];
julia> LinRange(first(vals), last(vals), 5)
5-element LinRange{Day, Int64}:
Error showing value of type LinRange{Day, Int64}:
ERROR: InexactError: Int64(553856.25)
Thus, I'm not sure how we should construct a Dimension
with regularly spaced dates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A StepRange works for Dates. Probably we should use StepRangeLen instead of LinRange where possible anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works. Do you want to use StepRangeLen
for numerical coordinates or just dates?
Here's the result:
julia> xdims = X(LinRange{Float64}(610000.0, 661180.0, 2560));
julia> ydims = Y(LinRange{Float64}(6.84142e6, 6.79024e6, 2560));
julia> bdims = Dim{:Band}([:B02, :B03, :B04]);
julia> tdims = Dim{:Ti}([d1 + Day(i * 7) for i in 0:4]);
julia> d = DimArray(rand(UInt16, 2560, 2560, 3, 5), (xdims, ydims, bdims, tdims));
julia> t = DataFrame(d);
julia> DD.guess_dims(t)
↓ X Sampled{Float64} 610000.0:20.0:661180.0 ForwardOrdered Regular Points,
→ Y Sampled{Float64} 6.84142e6:-20.0:6.79024e6 ReverseOrdered Regular Points,
↗ Ti Sampled{Date} Date("2024-11-18"):Day(7):Date("2024-12-16") ForwardOrdered Regular Points,
⬔ Band Categorical{Symbol} [:B02, :B03, :B04] ForwardOrdered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably better for everything, except a bit slower. It didn't exist when I first wrote this package and uses of LinRange are just legacy from that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, a bunch of minor fixes.
Probably the main question is how Number
and DateTime
is handled
Any news on this PR @JoshuaBillson ? would be good to have this available |
Co-authored-by: Rafael Schouten <rafaelschouten@gmail.com>
Co-authored-by: Rafael Schouten <rafaelschouten@gmail.com>
Co-authored-by: Rafael Schouten <rafaelschouten@gmail.com>
Co-authored-by: Rafael Schouten <rafaelschouten@gmail.com>
Co-authored-by: Rafael Schouten <rafaelschouten@gmail.com>
Co-authored-by: Rafael Schouten <rafaelschouten@gmail.com>
Sorry, I've been busy with some other projects. I'll try to resolve your suggested fixes this week. Once that's done, I think we just need to update the docs for |
No worries at all, whenever you have time. There are just some nice consequences of having this, like loading GeoJSON to a vector data cube, so I'm keen to have it. |
Co-authored-by: Rafael Schouten <rafaelschouten@gmail.com>
Co-authored-by: Rafael Schouten <rafaelschouten@gmail.com>
Co-authored-by: Rafael Schouten <rafaelschouten@gmail.com>
…a.jl into materialize
# Write a single column from a table with one or more coordinate columns to a DimArray | ||
function DimArray(table, dims; name=NoName(), selector=Near(), precision=6, missingval=missing, kw...) | ||
# Confirm that the Tables interface is implemented | ||
Tables.istable(table) || throw(ArgumentError("`table` must satisfy the `Tables.jl` interface.")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tables.istable(table) || throw(ArgumentError("`table` must satisfy the `Tables.jl` interface.")) | |
Tables.istable(table) || throw(ArgumentError("`obj` must be an `AbstractArray` or satisfy the `Tables.jl` interface.")) |
People will also hit this method if they do something weird like pass a non-AbstractArray to DimArray
.
Description:
resolves #335
This PR aims to let users construct either a
DimStack
orDimArray
from a table with one or more coordinate columns.Unlike the existing contructor, rows may be out of order or even missing altogether.
Performance:
The algorithm is O(n), requiring two forward passes for each dimension to determine the correct order of rows.
1000x1000: 0.005 Seconds
2000x2000: 0.025 Seconds
4000x4000: 0.108 Seconds
8000x8000: 0.376 Seconds
Example:
Next Steps:
missing
by default).:geometry
column.