- GeNSIT: Generating Neural Spatial Interaction Tables
- Introduction
- Installation
- Inputs
- Problem setup
- Functionality
- Conclusion
- Acknowledgments
Quick Start: We recommended going through sections on Installation and Run if you wish to run
GeNSIT
using default settings.
High-resolution complex simulators such as agent-based models (ABMs) are increasingly deployed to assist policymaking in transportation , social sciences, and epidemiology. They simulate individual agent interactions governed by stochastic dynamic systems, giving rise to an aggregate, in a mean field sense, continuous emergent structure. This is achieved by computationally expensive forward simulations, which hinders ABM parameter calibration and large-scale testing of multiple policy scenarios. Considering ABMs for the COVID-19 pandemic as an example, the continuous mean field process corresponds to the spatial intensity of the infections which is noisily observed at some spatial aggregation level, while the individual and discrete human contact interactions that give rise to that intensity are at best partially observed or fully latent. In transportation and mobility, running examples in this work, the continuous mean field process corresponds to the spatial intensity of trips arising from unobserved individual agent trips between discrete sets of origin and destination locations.
The formal object of interest that describes the discrete count of these spatial interactions, e.g. agent trips between locations, is the origin-destination matrix (ODM). It is an
This repository introduces a computational framework named GeNSIT
see for exploring the constrained discrete origin-destination matrices of agent trip location choices using closed-form or Gibbs Markov Basis sampling. The underlying continuous choice probability or intensity function (unnormalised probability function) is modelled by total and singly constrained spatial interaction models (SIMs) or gravity models embedded in the well-known Harris Wilson stochastic differential equations (SDEs). We employ Neural Networks to calibrate the SIM parameters. We include Markov Chain Monte Carlo (MCMC) schemes leveraged to learn the SIM parameters in previous works. For more details on the mathematical aspects of this repository please look at the Publications section.
Zachos, Ioannis, Theodoros Damoulas, et al. ‘Table Inference for Combinatorial Origin-Destination Choices in Agent-Based Population Synthesis’. Stat, vol. 13, no. 1, 2024, p. e656, https://doi.org/10.1002/sta4.656.
Zachos, Ioannis, Mark Girolami, et al. Generating Origin-Destination Matrices in Neural Spatial Interaction Models. no. arXiv:2410.07352, arXiv, Oct. 2024, https://doi.org/10.48550/arXiv.2410.07352. arXiv.
Assuming Python >=3.9.7 and git are installed, clone this repository by running
git clone git@github.com:[REPONAME]/GeNSIT.git
Once available locally, navigate to the main folder as follows:
cd GeNSIT
Tip: We recommended running
GeNSIT
on aDocker
container if you do not plan to make any code changes.
This section assumes Docker
has been installed on your machine. Please follow this guide if you wish to install Docker
.
Build the docker image image
docker build -t "gensit" .
Once installed, make sure everything is working by running
docker run gensit --help
This section assumes anaconda
or miniconda
has been installed on your machine. Please follow this or this guide if you wish to install either of them. Then, run:
conda create -y -n gensit python=3.9.7
conda activate gensit
conda install -y -c conda-forge --file requirements.txt
conda install -y conda-build
python3 setup.py develop
Otherwise, make sure you install the gensit
command line tool and its dependencies by running
pip3 install -e .
You can ensure that the dependencies have been successfully installed by running:
gensit --help
You should get a print statement like this:
Usage: gensit [OPTIONS] COMMAND [ARGS]...
Command line tool for Generating Neural Spatial Interaction Tables (origin-
destination matrices)
Options:
--help Show this message and exit.
Commands:
create Create synthetic data for spatial interaction table and...
plot Plot experimental outputs.
reproduce Reproduce figures in the paper.
run Sample discrete spatial interaction tables...
summarise Create tabular summary of metadata, metrics computed for...
Throughout the remainder of this readme we illustrate GeNSIT's
command line tool capabilities assuming that a docker
container has been installed.
Inputs to GeNSIT
are data and configuration files.
The minimum data requirements include:
- A set of origin and destination locations between which agents travel.
- A cost matrix
$\mathbf{C}$ reflecting inconvenience of travel from any origin to any destination. This can be distance and/or time dependent (e.g. Euclidean distance and/or travel times). - A measure of destination attractiveness
$\mathbf{z}$ . This depends on the types of trips agents make e.g. for work trips this would be number of jobs available at each destination. - The total number of agents/trips
$M$ . Each agent performs exactly one trip.
Optional datasets may be:
- Origin and/or destination demand.
- Partially observed trips between selected origin-destination pairs.
- Total distance and/or time agents have travelled by origin and/or destination location.
- A transportation network/graph.
- A ground truth agent trip table to validate your model.
We consider agent trips from residence to workplace locations in Cambridge, UK. We use the following datasets from the Census 2011 data provided by the Office of National Statistics:
- Lower super output areas (LSOAs), Middle super output areas (MSOAs) as origin, destination locations, respectively.
- Average shortest path in a transportation network between a random sample of 20 residences inside each LSOA and 20 workplaces inside each MSOA as a cost matrix.
- Number of jobs available at each MSOA as a destination attraction proxy used in the NN's loss function.
- Total distance travelled to work from each LSOA as an input to the NN's loss function.
- Ground truth agent trip table a validation dataset. Parts of this table such as origin/destination demand (row/colsums) and a random subset of trips (cells) are also conditioned upon acting as table constraint data.
We note the transportation network as well as the residence and workplace locations were extracted using Arup's genet
and osmox
, respectively. The geo-referenced map used as an input to these tools was downloaded from Open Street Maps.
Alternatively, synthetic data may be generated by running commands such as:
docker run gensit create ./data/inputs/configs/generic/synthetic_data_generation.toml \
-dim origin 100 -dim destination 30 -dim time 1 \
-sigma 0.0141421356 --synthesis_n_samples 1000 --synthesis_method sde_solver
The command above creates synthetic data based on the requirements in the section above using origin
and destination
aritificial locations (in our case 100 origins, 30 destinations). A cost matrix is randomly generated for every OD pair. Destination attraction data is generated by running the synthesis_method
for synthesis_n_samples
steps (in our case by running the Harris Wilson SDE solver for 1000 steps).
You noticed that we load a configuration file named synthetic_data_generation.toml
to achieve all this. We elaborate on the use of configs in the next section.
Configuration files contain all settings (key-value pairs) required to run
NN-based or MCMC-based algorithms for learning the discrete origin-destination table and/or underlying continuous SIM parameters. They are stored in a toml
format.
Each type of algorithm is associated with an experiment
type . We hereby refer to the process of running one algorirthm for a given set of configuration parameters as run
ning an experiment
. Examples of experiments include SIM_NN
, SIM_MCMC
, JointTableSIM_MCMC
, DisjointTableSIM_NN
, and JointTableSIM_NN
.
Most configuration keys can be sweep
ed for each type of experiment
being run. This means that a range of values over which the experiment will be run can be provided. For example, the sigma
parameter below
[harris_wilson_model.parameters.sigma.sweep]
default = 0.0141421356
range = [0.0141421356, 0.1414213562, nan]
means that each experiment in the experiments
section will run with sigma
= 0.0141421356 and sigma
= 0.141421356. A sweep
is therefore one run of an experiment
over a unique set of config values. Sweeps can be either isolated or coupled. The above example constitutes an isolated sweep. A coupled sweep is shown below:
[harris_wilson_model.parameters.sigma.sweep]
default = 0.0141421356
range = [0.0141421356, 0.1414213562, nan]
[training.to_learn.sweep]
default = ['alpha', 'beta']
range = [['alpha', 'beta'],['alpha', 'beta'],['alpha', 'beta', 'sigma']]
coupled = true
target_name = 'sigma'
Here sigma
is coupled with the to_learn
parameter, meaning the vary together. In this case each experiment will be run for three different sweep settings: (sigma = 0.0141421356
, to_learn
= ['alpha','beta']), (sigma
= 0.1414213562, to_learn
= ['alpha','beta']), and (sigma
= nan, to_learn
= ['alpha','beta','sigma']). We note that more than one sweep keys can be coupled.
Note: More information on each key-value pair found in Configs can be found here.
Consider
while the working population at each destination (column sums) is
We assume that the total origin and destination demand are both conserved:
The demand for destination zones depends on the destination's attractiveness denoted by
where
where the multipliers
Spatial interaction models are connected to physics models through the destination attractiveness term
where
We recommend you look at relevant publications for more information on the Harris Wilson model. Our first goal is to learn the parameters
We note that the discrete number of agents traveling to work is represented by
Although
The GeNSIT
package provides functionality for five different operations: create
, run
, plot
, reproduce
, summarise
.
⚠️ WARNING: Python tests have not been updated yet!
This command runs experiment
s using Markov Chain Monte Carlo and/or Neural Networks based on a Config
file. For example, we can run joint table and intensity inference using the following command
docker run gensit run ./data/inputs/configs/generic/joint_table_sim_inference.toml \
-et JointTableSIM_NN -nw 6 -nt 3
This config runs a JointTableSIM_NN
experiment using 6 number of workers and 3 number of threads per worker. A list of experiments and the types of algorithms they use to learn
Experiment | ||
---|---|---|
SIM_MCMC |
- | MCMC |
JointTableSIM_MCMC |
MCMC | MCMC |
SIM_NN |
- | NN |
DisjointTableSIM_NN |
MCMC | NN |
JointTableSIM_NN |
MCMC | NN |
The run
command can also be programmatically executed using the notebook Example 1 - Running experiments.
Once an experiment has been completed, we can use the following command to plot its data:
docker run gensit plot [PLOT_VIEW] [PLOT_TYPE] -x [X_DATA] -y [Y_DATA]
where PLOT_VIEW
defines the type of view the data should be shown. Views can be simple, tabular or spatial. PLOT_TYPE
can be either line or scatter. The X_DATA
or Y_DATA
are provided as names of experiment outputs or their evaluated expressions (see Config settings).
For example, the code below plots the log destination attraction predictions (x-axis) against the observed data (y-axis) for experiments JointTableSIM_MCMC
,JointTableSIM_NN
,NonJointTableSIM_NN
.
docker run gensit plot simple scatter \
-y log_destination_attraction_data -x mean_log_destination_attraction_predictions \
-dn cambridge_work_commuter_lsoas_to_msoas/exp1 \
-et JointTableSIM_MCMC -et JointTableSIM_NN -et NonJointTableSIM_NN \
-el np -el xr -el MathUtils \
-e mean_log_destination_attraction_predictions "signed_mean_func(log_destination_attraction,sign,dim=['id']).squeeze('time')" \
-e mean_log_destination_attraction_predictions "log_destination_attraction.mean('id').squeeze('time')" \
-e log_destination_attraction_data "np.log(destination_attraction_ts).squeeze('time')" \
-ea log_destination_attraction -ea sign \
-ea "destination_attraction_ts=outputs.inputs.data.destination_attraction_ts" \
-ea "signed_mean_func=MathUtils.signed_mean" \
-k sigma -k title \
-cs "da.loss_name.isin([str(['dest_attraction_ts_likelihood_loss']),str(['dest_attraction_ts_likelihood_loss', 'table_likelihood_loss'])])" \
-cs "~da.title.isin(['_unconstrained','_total_constrained','_total_intensity_row_table_constrained'])" \
-c title -op 1.0 -mrkr sigma -l title -l sigma -msz 20 \
-ft 'predictions_figure/destination_attraction_predictions_vs_observations' \
-xlab '$\mathbb{E}\left[\mathbf{x}^{(1:N)}\right]$' \
-ylab '$\mathbf{y}$'
The -e
,-ea
,-el
arguments define the evaluated expressions, the keyword arguments used as input to these expressions (also evaluated) and the necessary libraries that are used to perform the operations, respectively. The evaluation is performed using Python's eval
function. The first two types of argument allow for reading input and/or output data directly. For example, -ea "destination_attraction_ts=outputs.inputs.data.destination_attraction_ts"
loads the input (observed) destination attraction time series data while -ea log_destination_attraction -ea sign
loads log_destination_attraction
and sign
output datasets.
The output data is sliced using the coordinate values specified by the -cs
arguments. For instance, -cs "~da.title.isin(['_unconstrained','_total_constrained','_total_intensity_row_table_constrained'])"
only keeps the datasets whose title
variable is equal to any of the specified values. The sweep
data are gathered either from the output dataset itself or from the output config file (in this case we elicit sigma
,title
sweep
variables).
The scatter plot is colored by the title
variable and its markers are determined by the sigma
variable. Both of these variables are contained in each sweep
that was run. The exact mappings from say sigma values to marker types are contained in this file. Each point is labeled by both the title
and sigma
values. The resulting figure is shown below.
This command summarised the output data and creates a csv
file with each data summary from every sweep
. For example, if we wish to compute the Standardised Root Mean Square Error (SRMSE) for JointTableSIM_NN
we run
docker run gensit summarise \
-dn cambridge_work_commuter_lsoas_to_msoas/exp1 \
-et JointTableSIM_NN \
-el np -el MathUtils -el xr \
-e table_srmse "srmse_func(prediction=mean_table,ground_truth=ground_truth)" \
-e intensity_srmse "srmse_func(prediction=mean_intensity,ground_truth=ground_truth)" \
-ea table -ea intensity -ea sign \
-ea "srmse_func=MathUtils.srmse" \
-ea "signed_mean_func=MathUtils.signed_mean" \
-ea "ground_truth=outputs.inputs.data.ground_truth_table" \
-ea "mean_table=table.mean(['id'])" \
-ea "mean_intensity=signed_mean_func(intensity,'intensity','signedmean',dim=['id'])" \
-ea "mean_intensity=intensity.mean(['id'])" \
-cs "da.loss_name.isin([str(['dest_attraction_ts_likelihood_loss']),str(['dest_attraction_ts_likelihood_loss', 'table_likelihood_loss']),str(['table_likelihood_loss'])])" \
-btt 'iter' 10000 90 1000 \
-k sigma -k type -k name -k title -fe SRMSEs -nw 20
The arguments are similar to the plot
command. Here we also use -btt
refered to as burning, thinning and trimming to slice the iter
coordinate values based on their index. In this occasion, we discard the first 10000 samples and then only keep every 90th sample. Finally, we trim this data array to 1000 elements. A small part of the summarised table is shown below.
type | sigma | title | name | proposal | intensity_srmse | table_srmse |
---|---|---|---|---|---|---|
JointTableSIM_NN | 0.141 | doubly_20%_cell_constrained | TotallyConstrained | degree_higher | [1.98] | [0.38] |
JointTableSIM_NN | 0.0141 | unconstrained | TotallyConstrained | direct_sampling | [29.51] | [1.73] |
JointTableSIM_NN | 0.141 | unconstrained | TotallyConstrained | direct_sampling | [29.51] | [1.73] |
JointTableSIM_NN | 0.141 | doubly_constrained | TotallyConstrained | degree_higher | [2.02] | [0.46] |
JointTableSIM_NN | 0.0141 | doubly_10%_cell_constrained | TotallyConstrained | degree_higher | [0.89] | [0.42] |
JointTableSIM_NN | 0.141 | total_intensity_row_table_constrained | TotallyConstrained | direct_sampling | [5.93] | [2.16] |
JointTableSIM_NN | 0.0141 | doubly_20%_cell_constrained | TotallyConstrained | degree_higher | [0.94] | [0.38] |
JointTableSIM_NN | 0.141 | doubly_constrained | TotallyConstrained | degree_higher | [0.68] | [0.55] |
JointTableSIM_NN | 0.141 | unconstrained | TotallyConstrained | direct_sampling | [29.51] | [1.73] |
Processing experimental outputs for uses similar to the ones provided by plot
and summarise
commands can also be achieved by following the steps of notebook Example 2 - Reading outputs.
Finally, this command is run to reproduce the figures appearing in the papers. The commands are self-explanatory:
docker run gensit reproduce figure1;
docker run gensit reproduce figure2;
docker run gensit reproduce figure3;
docker run gensit reproduce figure4;
We have introduced GeNSIT
, an efficient framework for sampling jointly the discrete combinatorial space of agent trips (
Thank you for visiting our GitHub repository! We're thrilled to have you here. If you find our project useful or interesting, please consider showing your support by starring the repository and forking it to explore its features and contribute to its development. Your support means a lot to us and helps us grow the community around this project. If you have any questions or feedback, feel free to open an issue or reach out to us.