Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec for loading tables #66

Closed
smmaurer opened this issue Nov 26, 2018 · 2 comments
Closed

Spec for loading tables #66

smmaurer opened this issue Nov 26, 2018 · 2 comments

Comments

@smmaurer
Copy link
Member

smmaurer commented Nov 26, 2018

This is a proposed v1 spec for loading tables and registering them with Orca. Incorporating this into the templates will allow us to replace datasources.py with a set of yaml settings files, generated from a notebook or GUI.

Table class

Parameters:

  • table_name
  • source_type: "csv", "hdf", database sources added as needed
  • path: local file path
  • url: remote url
  • csv_index_cols: required for csv
  • csv_settings: dict of additional params to pass to pd.read_csv(), optional
  • hdf_key: name of table to read from the HDFStore, if there are multiple
  • zipped: bool (default False)
  • path_in_archive: where to find the desired file if there are multiple in the archive, optional
  • filters: filter table before registering it with Orca
  • orca_test_spec: dict, assert data characteristics to be tested at runtime
  • cache: bool, passed to Orca
  • cache_scope: passed to Orca
  • copy_col: passed to Orca

Data sources

Here are some use cases we should support:

  1. Data files are stored locally. User provides a path only.

  2. Data files are available from remote URLs. This is useful for demos and for small models whose source data is fairly stable. User provides a url only.

  3. Data files are available from remote URLs, but are also stored locally on each user's machine for faster performance. This is a nice workflow for larger models. ModelManager would first check the path, and then fall back to the url if the data is not available locally. This would simplify the process of deploying models on new machines or running them from JupyterHub containers.

  4. Tables are stored in a database. In this case, I think we should load the tables from the database exclusively. (Or provide local caching that's invisible to the user.) Too hard to maintain parity between a database and arbitrary local data files.

  5. Local data files are stored in different locations on different machines, e.g. local storage vs. a network directory. In this case, I think ModelManager should accept a session-based override of the default data path, similar to how we handle custom config locations.

    If you're using a non-standard data directory, you would include this line at the top of a notebook or script. The paths in the yaml files would be evaluated relative to the ModelManager data path.

    modelmanager.set_data_path('/data/urbansim_data')

Data cleaning on the fly: not supported?

Our current data loading scripts often perform some data cleaning as well: renaming columns, dropping or resetting missing values, generating derived variables, etc. There can be good reasons to do this (lets us leave data files in their native format while documenting the steps needed to prepare them for analysis), but overall I don't like it. It's open-ended (hard to incorporate into the templates), can make the model logic less transparent, and can slow down simulations.

What i suggest instead is that we try to maintain a tidy set of data cleaning scripts for each model, and store on disk the exact data that's needed for the simulation.

But if we do want to support performing custom actions on a DataFrame before registering it with Orca, we could do it with named callables. The user could provide these in a file called custom_logic.py (or something), and reference them by name in the template definition. I'm sure there are cases where this flexibility would come in handy.

Broadcast object

Parameters passed through to Orca:

  • cast
  • onto
  • cast_on
  • onto_on
  • cast_index
  • onto_index

Are there additional settings we could use to make the broadcasts easier to manage?

YAML files

modelmanager_version: 0.1.dev19

saved_object:
    template: Table
    table_name: jobs
    source_type: csv
    path: jobs.csv
    <etc>

We can support both (1) yaml files that contain a single table definition and (2) yaml files that contain multiple associated tables (and broadcasts). These would be loaded into a ModelManager Group containing an arbitrary set of objects.

@smmaurer
Copy link
Member Author

Some more info:

Reading compressed files

pd.read_csv() has a compression parameter that you can use to automatically read from a zip file -- but it doesn't work if there are multiple files in the archive. If we relied on this we wouldn't need to implement any of our own parameters for unzipping, though, which would be nice.

Broadcasts

If we implement auto join key detection as described in #78, we don't need templates for broadcasts (except possibly for backward compatibility).

@smmaurer
Copy link
Member Author

Most of this is implemented in PR #93, with the exception of some of the more complicated cases discussed in "Data sources" above. Further discussion in Issue #94.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant