Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporal dimension in calc_<> functions #112

Open
eva0marques opened this issue Jul 26, 2024 · 9 comments
Open

Temporal dimension in calc_<> functions #112

eva0marques opened this issue Jul 26, 2024 · 9 comments
Assignees

Comments

@eva0marques
Copy link
Collaborator

eva0marques commented Jul 26, 2024

I am writing process and calc functions for other covariates that I need in my own project.
I would like to open a discussion on the spatio x temporal case.

Let's say I want to create a model of AI to predict temperature at several locs x timestamps. I need to extract spatial covariates (easy) but also spatio x temporal ones.

In my ideal world, to do so:

  1. I create a SpatVector or data.frame or sf/sftime with both geometry and time columns to give as locs param
  2. I use calc_ functions to add columns for each covariate (they can be spatial or spatiotemporal). The calc_ functions for spatio-temporal covariates handle the "time" dimension properly, depending on the user's criteria (for eg: if geophysical model outputs are available every 3 days, and my predictions are every day: calc_ downscales the temporal resolution. It can also do the opposite if I have hourly data).

It would look like this:

my_spacetime_sample |>
  calc_era5() |>
  calc_nlcd() |>
  calc_gmted() |>
  ...

For now, calc_ functions are not optimally designed for temporal dimension. It is implied that locs is a spatial dataframe without time column. When calculating spatio-temporal covariates, it extracts all the time series of from. But if locs already has a time column (for eg created after calculating another spatio-temporal covariate), it becomes a mess.

As a summary, I see the following limitations with our current version of calc_ :

  • we cannot use several calc_ functions in a row (I mean give the output of a calc function to the input of another calc function) after dealing with spatio-temporal covariates
  • unlike spatial dimension, temporal dimension is not fine-tuned in the extracting process
  • user still has a lot of work to do in order to merge all covariates in a single spatio-temporal table, especially when covariates are not timely indexed in the same way.

It is not urgent of course, but I think it would be interesting to address this discussion in the future for a better use of amadeus.

@eva0marques
Copy link
Collaborator Author

eva0marques commented Jul 26, 2024

My suggestions to improve this situation:

  • add time_column parameter in calc_ functions for spatio-temporal covariates (narr, geos, hms, gridmet, terraclimate). It would be a character designating the time column in locs.

  • check that time_column exists in locs (I would also rename locs by sample or points or something more general rather than explicitly spatial) and that the data format is correct (POSIXCT with date and time for eg)

  • add a parameter for time extraction preference (nearest, downscale, mean, median, precedent, following...)

  • create a function to extract at time stamp with the corresponding way

    1. extract all timeseries at each loc
    2. create and use function find_time(time_pts, time_cov, method)
    3. for each loc * time : extract the value of the corresponding covar date.

@mitchellmanware
Copy link
Collaborator

  1. Temporal summaries + download inputs
  2. Data frame "inflation" for static spatial variables
  3. Syncoronize calc_* functions where output from calc_1 is used as locs in calc_2
calc_1() |>
  calc_2() |>
  calc_3()

@eva0marques
Copy link
Collaborator Author

eva0marques commented Aug 1, 2024

In calc_ pipes it would be easier to distinguish spatiotemporal points from spatial points 🤔 (eventually include the inflate function from spatial pipe to spatiotemporal one):

If the goal is to create a datatable to feed AI models:

my_spatial_sample |>
  calc_nlcd() |>
  calc_gmted() |>
  ... |>
  inflate_to_spatiotemporal(timestamps) |>
  calc_era5() |>
  calc_modis()
  ...

If the goal is to store efficiently the calculated points:

my_spatial_sample |>
  calc_nlcd() |>
  calc_gmted() |>
  ... |>
  writeRDS()

my_spatiotemporal_sample |>
  calc_era5() |>
  calc_modis() |>
  ... |>
  writeRDS()

@mitchellmanware
Copy link
Collaborator

I think an option is updating the static calc functions to have an inflate parameter. If inflate = TRUE it automatically returns a spatio-temporal data frame (feed AI models example) where if inflate = FALSE it is a list with a vector of dates and single spatial data frame (efficiency example).

Either way refactoring the calc_ functions to retain columns from the locs to use in a pipe should not be too difficult to add.

@mitchellmanware
Copy link
Collaborator

mitchellmanware commented Aug 1, 2024

Something like this

if (inflate) {
  message("Returning a list with ... because inflate = TRUE")
  inflated <- merge(dates, data.frame, all = TRUE)
  return(inflated)
} else {
  message("Returning a data.frame with ... because inflate = FALSE")
  return(list(dates, data.frame))
}

@eva0marques
Copy link
Collaborator Author

eva0marques commented Aug 1, 2024

Yes it is also an interesting solution. I would still make the inflate() function available to Amadeus users because they might be interested to use it separately. For eg, you store the non-inflated sample, reopen it, and use inflate function without recalculating everything.

@sigmafelix
Copy link
Collaborator

@eva0marques

Sorry I am late for the discussion. As @mitchellmanware suggested, I think that a hands-on solution by adding several lines into calc_return_locs with inflate argument added. One thing to consider is how "full" space-time combinations are inferred or furnished, which can be implemented by using a fixed set of field names (i.e., lon, lat, and time) or by adding additional argument for a full space-time combination templates (by using expand.grid, for example). I think the former is more of a hands-on solution since we easily utilize set operations to detect the common field names for determining what to join and to expand. I have added some functions to do this in beethoven already, so I'd be happy to make changes in functions at which we will agree to update to implement this functionality.

@sigmafelix
Copy link
Collaborator

As a side note, if we are aiming to make calc_* functions to be piped, the default value of inflate or the equivalent argument should be TRUE.

@eva0marques
Copy link
Collaborator Author

I've implemented my idea (my comment above) on my own project because it was the most optimized and flexible set up. It works pretty well, I'll be able to share my feedback if you are interested.

@kyle-messier kyle-messier mentioned this issue Sep 22, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants