Skip to content

enifeder/seq2vec

Repository files navigation

seq2vec

Seq2Vec model code + analysis utilities, with an emphasis on model–neuronal comparison (RSA, regression, CCA, UMAP, time-resolved metrics, etc.).

The Python source lives under src/python/. Most scripts are intended to be run from the repo root (see “Running” below).

Repo layout (high level)

  • src/: source code
    • src/python/: Python modules and scripts (primary code)
    • src/ref_matlab/: MATLAB reference extraction scripts + a small reference .mat
  • data/: datasets and intermediate data products (mostly gitignored; one example dataset is tracked)
  • models/: model checkpoints (mostly gitignored; one example checkpoint is tracked)
  • views/: analysis figures and inference cache (gitignored; generated)
  • MLspike/: external/embedded code folder (present locally; ignored by git per .gitignore)
  • .cursor/, .venv/, __pycache__/: editor/venv/python cache (not part of runtime artifacts)

For a more detailed breakdown of src/python/, see src/python/README.md.

Folder-by-folder details (what goes where)

src/python/

Python code is organized by workflow:

  • src/python/paths.py: path “single source of truth”. DATA_HOME points at the directory that contains data/, models/, views/.
  • src/python/model/: model training + model–neural comparison.
    • src/python/model/train/: Seq2Vec/autoencoder training, dataset construction, probes, and tests.
    • src/python/model/analysis/: model–neuronal comparison analyses and plotting.
      • run_all.py: orchestrates RSA/CCA/regression/UMAP/etc. from a single config block.
      • preprocess.py: loads dataset + checkpoint, aligns trials, extracts representations, writes inference cache.
  • src/python/analysis/: “neural-only” analyses and plots (correlation-over-time, dynamics GIF utilities, block direction significance, etc.).
  • src/python/behavior/: behavioral metrics and plots (success rate, expert days, movement-event plots).
  • src/python/data_subject/ and src/python/subject_data/: utilities for working with subject/session data and trial extraction/alignment (loading pickles, neuron selection, time normalization).
  • src/python/deconvolution/: scripts for examining spike deconvolution quality.

src/ref_matlab/

Reference MATLAB extraction code and a small reference dataset:

  • extract_data_seq2vec_v5.m: produces data_seq2vec_v5_m06.mat (event-code normalization + time compression + train/val selection).
  • data_seq2vec_v5_m06.mat: reference .mat output (see “Data/file formats” below).

data/ (under DATA_HOME)

Intended storage for datasets and subject data.

  • Tracked example: data/dataset/seq2vec_dataset_6_8_r0.8_s42_rew.npz
  • Common (not tracked): data/subject_mark/ (subject pickles and/or data_mark.mat), data/raw/, etc. (see .gitignore)

models/ (under DATA_HOME)

Intended storage for checkpoints.

  • Tracked example: models/autoenc_6_8_conv_lr3e-3_a0_pk0_nll_pmse_pmm01_flip.pth

views/ (under DATA_HOME)

Generated outputs only.

  • Figures typically go under views/model_neuronal_comparison/ when running src/python/model/analysis/run_all.py.
  • Cached inference (model forward pass outputs) goes under views/model_neuronal_comparison/inference_cache/.

Requirements

  • Python: 3.10+ (see pyproject.toml)
  • Core deps: see requirements.txt

Setup (virtualenv)

From the repo root:

PowerShell

python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -U pip
pip install -r requirements.txt

bash

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
pip install -r requirements.txt

Configure DATA_HOME (required)

This repo uses a single source of truth for disk locations in src/python/paths.py.

Edit DATA_HOME so it points at the directory that contains your:

  • data/
  • models/
  • views/

By default it is set to the repo path:

  • DATA_HOME = Path(r"D:\Projects\seq2vec")

If you move the project or keep large artifacts elsewhere, update DATA_HOME accordingly.

Running

Run commands from the repo root with PYTHONPATH=. so src.python... is importable.

PowerShell

$env:PYTHONPATH="."
python -m src.python.model.analysis.run_all

bash

export PYTHONPATH=.
python -m src.python.model.analysis.run_all

Alternative (direct script invocation)

run_all.py also supports being executed as a file path (it inserts the project root into sys.path):

python src/python/model/analysis/run_all.py

What run_all.py expects

Open src/python/model/analysis/run_all.py and set the config block near the top, especially:

  • CHECKPOINT_PATH: relative to DATA_HOME (example in file: models/autoenc_... .pth)
  • DATASET_PATH: relative to DATA_HOME (example in file: data/dataset/... .npz)
  • SUBJECT, DAY, REGION (e.g. "cbl" or "ctx")
  • OUTPUT_DIR: relative to DATA_HOME (e.g. views/model_neuronal_comparison) or None to show plots without saving
  • RUN_ANALYSES: list of analysis IDs to run (e.g. [4] for RSA only)

Outputs and caching

  • Figures / results: typically under views/model_neuronal_comparison/ when OUTPUT_DIR is set in run_all.py.
  • Inference cache: model forward-pass results are cached as .npz files under:
    • views/model_neuronal_comparison/inference_cache/

Re-running the same analysis configuration should reuse cached inference and skip expensive forward passes.

Data/file formats (what’s inside)

Dataset .npz (under data/dataset/)

Datasets are stored as NumPy .npz archives. Code paths that read them include:

  • src/python/model/train/seq2vec_data.py (training/validation splits)
  • src/python/model/analysis/preprocess.py (analysis bundles; also uses “full trial” arrays for some analyses)

Common keys you will see (depending on how the dataset was generated):

  • Windowed / shifted-window data:
    • X_train, E_train, X_val, E_val (and optionally X_test, E_test)
    • train_idx, val_idx (and optionally test_idx)
  • Full-trial strips (used for “full trial” and shifted-window analysis bundles):
    • X_full_train, E_full_train, X_full_val, E_full_val, X_full_test, E_full_test
  • Offsets:
    • per_trial_offsets (optional; otherwise code may derive default offsets)

Inference cache .npz (under views/.../inference_cache/)

Cached forward-pass outputs produced by src/python/model/analysis/preprocess.py contain:

  • vec: (n_samples, hidden_dim)
  • rnn_out: (n_samples, T, hidden_dim)
  • slot_logits: (n_samples, T, 6)
  • logits: (n_samples, T, 6)

Checkpoint .pth (under models/)

PyTorch checkpoint used to load model weights for inference/training (e.g. autoencoder checkpoints used by run_all.py).

MATLAB reference .mat (under src/ref_matlab/)

src/ref_matlab/data_seq2vec_v5_m06.mat (generated by the MATLAB scripts here) is saved with variables like:

  • ses, cbl_vec, seq_b, seq_btype, seq_b2, seq_b2_2, mask, sbj_m, learning

Reference + data artifacts tracked in git (not exhaustive outputs)

This repo intentionally gitignores most large artifacts, but does include a few small/representative references:

  • Dataset example: data/dataset/seq2vec_dataset_6_8_r0.8_s42_rew.npz
  • Checkpoint example: models/autoenc_6_8_conv_lr3e-3_a0_pk0_nll_pmse_pmm01_flip.pth
  • MATLAB reference: src/ref_matlab/data_seq2vec_v5_m06.mat + extraction scripts in src/ref_matlab/
  • Probe/params JSON:
    • glm_params_probe.json, glm_params_probe_quick.json (repo root)
    • src/python/model/analysis/regression_params.json, src/python/model/analysis/glm_params.json
  • Misc reference text: email_from_Mark.txt

Common gotchas (internal)

  • Run from the repo root and set PYTHONPATH=. (otherwise from src.python... imports will fail).
  • Many large artifacts are intentionally not tracked by git (see .gitignore). You’re expected to have the relevant datasets/checkpoints present under your configured DATA_HOME.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors