Trajectory representation learning for similarity search of AIS data
MoCo-AIS is a contrastive learning framework designed to produce discriminative, embedding-based representations of vessel trajectories for large-scale similarity search. The framework integrates domain-specific trajectory augmentations, dual-stream encoders, and momentum contrastive learning to capture both geometric and semantic structure in AIS data.
moco-ais/
│
├── model/ # Encoder, projection head, and model components
│ ├── encoder.py
│ └── moco.py
│
├── utils/ # Shared utility functions (I/O, spatial tools, logging)
│ ├── utils.py
│ └── tool_funcs.py
│
├── base/ # Baseline functionalities, including t2vec and trajcl
│ ├── trajcl.py
│ ├── trajcl_utils.py
│ ├── config_trajcl.py
│ ├── t2vec.py
│ ├── t2vec_loss.py
│ └── similarity_metrics.py # functions for distance-based methods: Hausdorff and DTW
│
├── img/ # Figures for README or paper
├── grid/ # H3/Grid tokenizer, spatial indexing utilities for baseline TrajCL
│
├── preprocessing.ipynb # Raw AIS → cleaned trajectories preprocessing
├── preprocessing2.ipynb # Additional preprocessing utilities / region-specific operations
│
├── config.py # Global configuration and hyperparameter definitions
├── data_loader.py # Dataset loading, padding mask, batching logic
├── train.py # Main MoCo-AIS training script
├── test_.py # Test script for embeddings and distance matrices
│
├── evaluate.py # Retrieval evaluation: Recall@K, ranking, metrics
├── compute_hit_rate.py # Compute hit rate for top-K retrieval experiments
│
├── compute_dist_mat.py # Precompute embedding distance matrices with DTW and Hausdorff
├── t2vec_pipeline.ipynb # Baseline models (t2vec) for comparison
├── trajcl_pipeline.py # Baseline models (trajcl) for comparison
│
├── visualize_embeddings.py # 2D UMAP/TSNE embedding visualization
├── visualize_loss.py # Train-validation loss visualization
│
├── requirements.txt # Required package installations
├── README.md # Project documentation
└── .gitignore
Since AIS data used in our experiments involves restricted or proprietary vessel information, we are unable to release the original datasets. To support reproducibility, we provide a publicly available alternative dataset sourced from the U.S. Marine Cadastre AIS archive, preprocessed into a compact SQLite format (download link). This public dataset can be downloaded from the link below and directly used with the notebook preprocessing3.ipynb to reproduce our preprocessing and trajectory generation pipeline.

MoCo-AIS requires Python 3.9–3.12.
python -m venv mocoais_env
source mocoais_env/bin/activate # on Linux/Mac
mocoais_env\Scripts\activate # on Windows
pip install --upgrade pip
pip install -r requirements.txt
With the Marine Cadastre data in AISdb sqlite format, first run preprocessing3.ipynb to produce paired .lat and .lon training, validation and test data.
Once the .lat and .lon files are prepared, edit config.py to make sure the data path are correctly referred, also define the directories for saving embedding distance matrices and model checkpoints :
data="<YOUR DATA DIRECTORY>"
savedir="<DISTANCE MATRIX DIRECTORY>"
checkpoint="<MODEL CHECKPOINT DIRECTORY>"
To select the encoder plugins for MoCo-AIS, modify:
encoder_type="transformer" # transformer, gru, lstm, tcn
Other hyperparameters are also customizable in the config file.
For model training, run:
python train.py
During the training process, losses and time usage for each epoch will be recorded. You might log this information to a file for future reference and analysis.
Upon completion, the best model is saved. Run the test script to output the testing loss and infer embeddings to generate the trajectory distance matrix:
python test_.py
Evaluations can only be done after the trajectory distance matrix is available. The Mean Rank of similarity retrievals can be produced by running:
python evaluate.py
The hitting rate evaluation requires both trajectory distance matrix and a distance matrix computed from the distance-based metrics. Specify which distance-based metric to be compared with in config.py and then run:
python compute_hit_rate.py
Mean rank and rank percentage retrieval performance with MoCo-AIS and distance-based metrics:
| Metric | MoCo-AIS encoders | Distance metrics | ||||
|---|---|---|---|---|---|---|
| Transformer | GRU | LSTM | TCN | Hausdorff | DTW | |
| Mean Rank | 3.041 | 9.414 | 2.110 | 6.502 | 1.293 | 1.019 |
| Rank Percentage (%) | 0.052 | 0.161 | 0.036 | 0.111 | 0.110 | 0.087 |
| Time | secs. | hrs. | ||||
| 1.73 | 11.03 | 10.65 | 1.81 | 0.3183 | 5.650 |
To compute distance-based metrics, define the metric name in compute_dist_mat.py and then:
python compute_dist_mat.py
Two baseline pipeline notebooks: t2vec_pipeline.ipynb and trajcl_pipeline.ipynb are provided in the main directory. After adjusting the data paths as needed, run each notebook to generate trajectory embeddings and their corresponding similarity matrices.
For t2vec, evaluation is carried out directly within t2vec_pipeline.ipynb due to differences in its embedding file format.
For TrajCL, please follow the same Performance Evaluation procedure (see above) as MoCo-AIS by using the steps outlined in the Performance Evaluation section after completing trajcl_pipeline.ipynb.
(to be presented soon)