You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a proposed v1 spec for loading tables and registering them with Orca. Incorporating this into the templates will allow us to replace datasources.py with a set of yaml settings files, generated from a notebook or GUI.
Table class
Parameters:
table_name
source_type: "csv", "hdf", database sources added as needed
path: local file path
url: remote url
csv_index_cols: required for csv
csv_settings: dict of additional params to pass to pd.read_csv(), optional
hdf_key: name of table to read from the HDFStore, if there are multiple
zipped: bool (default False)
path_in_archive: where to find the desired file if there are multiple in the archive, optional
filters: filter table before registering it with Orca
orca_test_spec: dict, assert data characteristics to be tested at runtime
cache: bool, passed to Orca
cache_scope: passed to Orca
copy_col: passed to Orca
Data sources
Here are some use cases we should support:
Data files are stored locally. User provides a path only.
Data files are available from remote URLs. This is useful for demos and for small models whose source data is fairly stable. User provides a url only.
Data files are available from remote URLs, but are also stored locally on each user's machine for faster performance. This is a nice workflow for larger models. ModelManager would first check the path, and then fall back to the url if the data is not available locally. This would simplify the process of deploying models on new machines or running them from JupyterHub containers.
Tables are stored in a database. In this case, I think we should load the tables from the database exclusively. (Or provide local caching that's invisible to the user.) Too hard to maintain parity between a database and arbitrary local data files.
Local data files are stored in different locations on different machines, e.g. local storage vs. a network directory. In this case, I think ModelManager should accept a session-based override of the default data path, similar to how we handle custom config locations.
If you're using a non-standard data directory, you would include this line at the top of a notebook or script. The paths in the yaml files would be evaluated relative to the ModelManager data path.
modelmanager.set_data_path('/data/urbansim_data')
Data cleaning on the fly: not supported?
Our current data loading scripts often perform some data cleaning as well: renaming columns, dropping or resetting missing values, generating derived variables, etc. There can be good reasons to do this (lets us leave data files in their native format while documenting the steps needed to prepare them for analysis), but overall I don't like it. It's open-ended (hard to incorporate into the templates), can make the model logic less transparent, and can slow down simulations.
What i suggest instead is that we try to maintain a tidy set of data cleaning scripts for each model, and store on disk the exact data that's needed for the simulation.
But if we do want to support performing custom actions on a DataFrame before registering it with Orca, we could do it with named callables. The user could provide these in a file called custom_logic.py (or something), and reference them by name in the template definition. I'm sure there are cases where this flexibility would come in handy.
Broadcast object
Parameters passed through to Orca:
cast
onto
cast_on
onto_on
cast_index
onto_index
Are there additional settings we could use to make the broadcasts easier to manage?
We can support both (1) yaml files that contain a single table definition and (2) yaml files that contain multiple associated tables (and broadcasts). These would be loaded into a ModelManager Group containing an arbitrary set of objects.
The text was updated successfully, but these errors were encountered:
pd.read_csv() has a compression parameter that you can use to automatically read from a zip file -- but it doesn't work if there are multiple files in the archive. If we relied on this we wouldn't need to implement any of our own parameters for unzipping, though, which would be nice.
Broadcasts
If we implement auto join key detection as described in #78, we don't need templates for broadcasts (except possibly for backward compatibility).
Most of this is implemented in PR #93, with the exception of some of the more complicated cases discussed in "Data sources" above. Further discussion in Issue #94.
This is a proposed v1 spec for loading tables and registering them with Orca. Incorporating this into the templates will allow us to replace
datasources.py
with a set of yaml settings files, generated from a notebook or GUI.Table
classParameters:
table_name
source_type
: "csv", "hdf", database sources added as neededpath
: local file pathurl
: remote urlcsv_index_cols
: required for csvcsv_settings
: dict of additional params to pass topd.read_csv()
, optionalhdf_key
: name of table to read from the HDFStore, if there are multiplezipped
: bool (default False)path_in_archive
: where to find the desired file if there are multiple in the archive, optionalfilters
: filter table before registering it with Orcaorca_test_spec
: dict, assert data characteristics to be tested at runtimecache
: bool, passed to Orcacache_scope
: passed to Orcacopy_col
: passed to OrcaData sources
Here are some use cases we should support:
Data files are stored locally. User provides a
path
only.Data files are available from remote URLs. This is useful for demos and for small models whose source data is fairly stable. User provides a
url
only.Data files are available from remote URLs, but are also stored locally on each user's machine for faster performance. This is a nice workflow for larger models. ModelManager would first check the
path
, and then fall back to theurl
if the data is not available locally. This would simplify the process of deploying models on new machines or running them from JupyterHub containers.Tables are stored in a database. In this case, I think we should load the tables from the database exclusively. (Or provide local caching that's invisible to the user.) Too hard to maintain parity between a database and arbitrary local data files.
Local data files are stored in different locations on different machines, e.g. local storage vs. a network directory. In this case, I think ModelManager should accept a session-based override of the default data path, similar to how we handle custom config locations.
If you're using a non-standard data directory, you would include this line at the top of a notebook or script. The paths in the yaml files would be evaluated relative to the ModelManager data path.
modelmanager.set_data_path('/data/urbansim_data')
Data cleaning on the fly: not supported?
Our current data loading scripts often perform some data cleaning as well: renaming columns, dropping or resetting missing values, generating derived variables, etc. There can be good reasons to do this (lets us leave data files in their native format while documenting the steps needed to prepare them for analysis), but overall I don't like it. It's open-ended (hard to incorporate into the templates), can make the model logic less transparent, and can slow down simulations.
What i suggest instead is that we try to maintain a tidy set of data cleaning scripts for each model, and store on disk the exact data that's needed for the simulation.
But if we do want to support performing custom actions on a DataFrame before registering it with Orca, we could do it with named callables. The user could provide these in a file called
custom_logic.py
(or something), and reference them by name in the template definition. I'm sure there are cases where this flexibility would come in handy.Broadcast
objectParameters passed through to Orca:
cast
onto
cast_on
onto_on
cast_index
onto_index
Are there additional settings we could use to make the broadcasts easier to manage?
YAML files
We can support both (1) yaml files that contain a single table definition and (2) yaml files that contain multiple associated tables (and broadcasts). These would be loaded into a ModelManager
Group
containing an arbitrary set of objects.The text was updated successfully, but these errors were encountered: