Skip to content

Data API design questions #119

@ethanwhite

Description

@ethanwhite

The package currently consumes two data objects:

  1. a document term table that is cross-tab where the time component is implied by position
  2. a document covariate table that is long and contains both an explicit timeseries component and any associated covariates

Assembling these two objects for commonly formatted data would require some potentially fragile work and so I'm wondering if it's worth having a conversation about the data API at least for the top-level LDA_TS function.

I'm envisioning most users having long data in the general form of year, species, count and year, covariate_value (often with a site variable for both as well which in concept can be grouped by). To use with the current API this would require cross-tabbing the first table and if the sorts on the two tables aren't the same for some reason this will produce the wrong answer (hence my concern about this being a bit fragile).

My first thought was that the data should be long in both cases. I can see why this wasn't the initial implementation because it comes with its own set of issues, specifically that using long data would require passing the names of the "words" and "documents" columns so that the LDA step understands what it is supposed to work with. That said, I think this is more robust than assuming that the rows are the documents and the columns are the words since it's easy enough to mess up the cross-tabbing and get the components switched around. Given that this package is explicitly temporal, we could also use the opportunity to codify that in the API and outputs by passing the timename directly rather than via control = TS_controls_list(timename = "time_column").

I definitely don't fully understand the under-the-hood stuff that might make this change more complicated and think this is probably a pretty in-depth discussion, so I'd be happy to set up some time with whoever is interested to talk through the optimal API design (which could end up being what's already here).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions