sign-language-processing · AmitMY · Jan 6, 2025
diff --git a/.github/workflows/lint.yaml b/.github/workflows/lint.yaml
@@ -17,10 +17,10 @@ jobs:
       - uses: actions/checkout@v3
       - uses: actions/setup-python@v4
         with:
-          python-version: '3.10'
+          python-version: '3.12'
 
       - name: Install Requirements
-        run: pip install .[dev]
+        run: pip install ".[dev]"
 
       - name: Lint Code
         run: pylint sign_language_segmentation
diff --git a/.github/workflows/test.yaml b/.github/workflows/test.yaml
@@ -17,10 +17,10 @@ jobs:
       - uses: actions/checkout@v3
       - uses: actions/setup-python@v4
         with:
-          python-version: '3.10'
+          python-version: '3.12'
 
       - name: Install Requirements
-        run: pip install .[dev]
+        run: pip install ".[dev]"
 
       - name: Test Code
         run: pytest sign_language_segmentation
diff --git a/.gitignore b/.gitignore
@@ -1 +1,5 @@
-.idea/
+.idea/
+**/__pycache__/
+.pytest_cache/
+build/
+.env
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
diff --git a/README.md b/README.md
@@ -2,130 +2,86 @@
 
 Pose segmentation model on both the sentence and sign level
 
-Code for the paper [Linguistically Motivated Sign Language Segmentation](https://aclanthology.org/2023.findings-emnlp.846).
-
 ## Usage
 
-
 ```bash
+# Install the package
 pip install git+https://github.com/sign-language-processing/segmentation
-```
 
-To create an ELAN file with sign and sentence segments:
-(To demo this on a longer file, you can download a large pose file from [here](https://firebasestorage.googleapis.com/v0/b/sign-language-datasets/o/poses%2Fholistic%2Fdgs_corpus%2F1413451-11105600-11163240_a.pose?alt=media&token=432f0b57-3fb9-45ad-a9a4-0b6fae4ffcf7))
+# Acquire a MediaPipe Holistic pose file
+wget -O example.pose https://sign-lanugage-datasets.sign-mt.cloud/poses/holistic/dgs_corpus/1413451-11105600-11163240_a.pose
 
-```bash
-pose_to_segments --pose="sign.pose" --elan="sign.eaf" [--video="sign.mp4"]
+# Run the model!
+pose_to_segments --pose="example.pose" --elan="example.eaf" [--video="example.mp4"]
 ```
 
----
-
-## Main Idea
+## 2025 Version
 
-We tag pose sequences with BIO (beginning/in/out) and try to classify each frame. 
-Due to huge sequence sizes intended to work on (full videos), this is not done using a transformer.
-Loss is heavily weighted in favor of "B" as it is a "rare" prediction compared to I and O.
-
-
-#### Pseudo code:
-
-```python
-pose_embedding = embed_pose(pose)
-pose_encoding = encoder(pose_embedding)
-sign_bio = sign_bio_tagger(pose_encoding)
-sentence_bio = sentence_bio_tagger(pose_encoding)
-```
+The original version of the code supported many experimental model architectures.
+In the current version, we simplify the code base, to allow continuous support.
 
-## Extra details
+### Summary of Improvements
 
-- Model tests, including overfitting, and continuous integration
-- We remove the legs because they are not informative
-- For experiment management we use WANDB
-- Training works on CPU and GPU (90% util)
-- Multiple-GPUs not tested
+| Category              | Original (2023)              | Current (2025)          |
+|-----------------------|------------------------------|-------------------------|
+| Reliability           | Unreliable in the first 2-3s | Should be more reliable |
+| Inference Performance | Slow LSTM-based              | Fast CNN-based          |
+| Training Efficiency   | Used wasteful padding        | Using packed sequences  |
 
-## Motivation
+### Development
 
-### Optical flow 
-Optical flow is highly correlative to phrase boundaries. 
-
-![Optical flow](sign_language_segmentation/figures/optical_fow/optical_flow_sentence_example.png)
-
-### 3D Hand Normalization
-3D hand normalization may assist the model with learning hand shape changes.
-
-Watch [this video](https://youtu.be/pCKRWSNIaNQ?t=191) to see how it's done.
-
-## Reproducing Experiments
-
-### E0: Moryossef et al. (2020)
-This is an attempt to reproduce the methodology of Moryossef et al. (2020) on the DGS corpus.
-Since they used a different document split, and do not filter out wrong data, our results are not directly comparable. This model processes optical flow as input and outputs I (is signing) and O (not signing) tags.
+<details>
+<summary>Create the environment</summary>
 
 ```bash
-python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=64 --encoder_depth=1 --encoder_bidirectional=false --optical_flow=true --only_optical_flow=true --weighted_loss=false --classes=io
-```
+conda create --name segmentation python=3.12 -y
+conda activate segmentation
+pip install ".[dev]"
 
-### E1: Bidirectional BIO Tagger
-We replace the IO tagging heads in E0 with BIO heads to form our baseline. Our preliminary experiments indicate that inputting only the 75 hand and body keypoints and making the LSTM layer bidirectional yields optimal results.
-```bash
-python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true
-```
-Or for the mediapi-skel dataset (only phrase segmentation)
-```bash
-# FPS is not relevant for mediapi-skel
-export MEDIAPI_PATH=/shares/volk.cl.uzh/amoryo/datasets/mediapi/mediapi-skel.zip
-export MEDIAPI_POSE_PATH=/shares/volk.cl.uzh/amoryo/datasets/mediapi/mediapipe_zips.zip
-python -m sign_language_segmentation.src.train --dataset=mediapi_skel --pose=holistic --fps=0 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true
-```
-
-### E2: Adding Reduced Face Keypoints
-
-Although the 75 hand and body keypoints serve as an efficient minimal set for sign language detection/segmentation models, we investigate the impact of other nonmanual sign language articulators, namely, the face. We introduce a reduced set of 128 face keypoints that signify the signer's face contour.
-```bash
-python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true --pose_components POSE_LANDMARKS LEFT_HAND_LANDMARKS RIGHT_HAND_LANDMARKS FACE_LANDMARKS --pose_reduce_face=true
-```
-
-### E3: Adding Optical Flow
-
-At every time step $t$ we append the optical flow between $t$ and $t-1$ to the current pose frame as an additional dimension after the $XYZ$ axes.
-```bash
-python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true --optical_flow=true
+# Confirm the environment
+pylint sign_language_segmentation
+pytest sign_language_segmentation
 ```
 
-### E4: Adding 3D Hand Normalization
+</details>
 
-At every time step, we normalize the hand poses and concatenate them to the current pose frame.
-```bash
-python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true --optical_flow=true --hand_normalization=true
-```
+<details>
+<summary>Prepare the dataset</summary>
 
-### E5: Autoregressive Encoder
+Requires access to the annotations database.
 
-We add autoregressive connections between time steps to encourage consistent output labels. The logits at time step $t$ are concatenated to the input of the next time step, $t+1$. This modification is implemented bidirectionally by stacking two autoregressive encoders and adding their output up before the Softmax operation. However, this approach is inherently slow, as we have to fully wait for the previous time step predictions before we can feed them to the next time step.
 ```bash
-python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=4 --encoder_bidirectional=true --encoder_autoregressive=true --optical_flow=true --hand_normalization=true --epochs=50 --patience=10
+# Prepare the entire dataset
+python -m sign_language_segmentation.data.create_dataset \
+  --poses="/Volumes/Echo/GCS/sign-mt-poses/" \
+  --output="/tmp/segmentation/"
+
+# Make sure you can load the dataset
+python -m sign_language_segmentation.data.dataset \
+  --dataset="/tmp/segmentation/"
 ```
 
-CAUTION: this experiment does not improve the model as expected and runs very slowly.
-
-## Test and Evaluation
+</details>
 
-To test and evaluate a model, add the `train=false` and `--checkpoint` flag. Take E1 as an example:
+<details>
+<summary>Train the model</summary>
 
 ```bash
-python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true --train=false --checkpoint=./models/E1-1/best.ckpt
+python -m sign_language_segmentation.src.train \
+    --dataset=dgs_corpus \
+    --pose=holistic \
+    --fps=25 \
+    --hidden_dim=256 \
+    --encoder_depth=1 \
+    --encoder_bidirectional=true
 ```
 
-It's also possible to adjust the decoding algorithm by setting the `b_threshold` and the `o_threshold`:
+</details>
 
-```bash
-python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true --train=false --checkpoint=./models/E1-1/best.ckpt --b_threshold=50 --o_threshold=50
-```
+## 2023 Version ([v2023](https://github.com/sign-language-processing/segmentation/tree/v2023))
 
-To test on an external dataset, see [evaluate_mediapi.py](https://github.com/sign-language-processing/transcription/blob/main/sign_language_segmentation/src/evaluate_mediapi.py) for an example.
-
-## Cite
+Exact code for the
+paper [Linguistically Motivated Sign Language Segmentation](https://aclanthology.org/2023.findings-emnlp.846).
 
 ```bibtex
 @inproceedings{moryossef-etal-2023-linguistically,
@@ -141,4 +97,4 @@ To test on an external dataset, see [evaluate_mediapi.py](https://github.com/sig
     doi = "10.18653/v1/2023.findings-emnlp.846",
     pages = "12703--12724",
 }
-```
+```
diff --git a/pyproject.toml b/pyproject.toml
@@ -8,22 +8,22 @@ authors = [
 ]
 readme = "README.md"
 dependencies = [
-    "pose-format>=0.3.2",
+    "pose-format>=0.8.1",
     "numpy",
     "pympi-ling", # Working with ELAN files in CLI
     "torch",
+    "pose_anonymization @ git+https://github.com/sign-language-processing/pose-anonymization", # Used for normalization
 ]
 
 [project.optional-dependencies]
 dev = [
     "pytest",
     "pylint",
     "pytorch-lightning",
-    "sign-language-datasets",
     "wandb",
     "matplotlib",
     "scikit-learn",
-    "pandas"
+    "psycopg2-binary" # to fetch the annotations from the database
 ]
 
 [tool.yapf]

diff --git a/sign_language_segmentation/args.py b/sign_language_segmentation/args.py
@@ -0,0 +1,51 @@
+import random
+from argparse import ArgumentParser
+
+import numpy as np
+import torch
+
+parser = ArgumentParser()
+
+# wandb
+parser.add_argument('--no_wandb', action='store_true', default=True, help='ignore wandb?')
+parser.add_argument('--run_name', type=str, default=None, help='name of wandb run')
+parser.add_argument('--wandb_dir', type=str, default='.', help='where to store wandb data')
+
+# Training Arguments
+parser.add_argument('--seed', type=int, default=42, help='random seed')
+parser.add_argument('--device', type=str, default='gpu', help='device to use, cpu or gpu')
+parser.add_argument('--gpus', type=int, default=1, help='how many gpus')
+parser.add_argument('--epochs', type=int, default=100, help='how many epochs')
+parser.add_argument('--patience', type=int, default=20, help='how many epochs as the patience for early stopping')
+parser.add_argument('--batch_size', type=int, default=1, help='batch size')
+parser.add_argument('--num_frames_per_item', type=int, default=2 ** 10, help='batch size')
+parser.add_argument('--batch_size_devtest', type=int, default=20,
+                    help='batch size for dev and test (by default run all in one batch)')
+parser.add_argument('--learning_rate', type=float, default=1e-3, help='optimizer learning rate')
+parser.add_argument('--steps_per_epoch', type=int, default=100, help='steps per epoch')
+
+# Data Arguments
+parser.add_argument('--dataset', default='/tmp/segmentation', help='which dataset to use?')
+
+# Model Arguments
+parser.add_argument('--hidden_dim', type=int, default=256, help='encoder hidden dimension')
+
+parser.add_argument('--save_jit', action="store_true", default=False, help='whether to save model without code?')
+
+# Prediction args
+parser.add_argument('--checkpoint', type=str, default=None, metavar='PATH', help="Checkpoint path for prediction")
+parser.add_argument('--pred_output', type=str, default=None, metavar='PATH', help="Path for saving prediction files")
+
+args = parser.parse_args()
+
+print('Agruments:', args)
+
+# ---------------------
+# Set Seed
+# ---------------------
+if args.seed == 0:  # Make seed random if 0
+    args.seed = random.randint(0, 1000)
+torch.manual_seed(args.seed)
+np.random.seed(args.seed)
+random.seed(args.seed)
+
diff --git a/sign_language_segmentation/data/README.md b/sign_language_segmentation/data/README.md
@@ -0,0 +1,26 @@
+# Data
+
+We rely on https://tube.sign.mt to provide annotations of sign and sentence spans.
+All annotation that start immediately, at time=0 are removed, as these are artifacts of the data creation process.
+
+Captions with the language codes `Sgnw` (SignWriting) and `hns` (HamNoSys) are considered signs. 
+`gloss` (Glosses) are considered signs only if there is no whitespace in the annotation.
+All other captions are considered sentences.
+
+## Split
+
+In [split.json](./split.json), we extend the data split from 
+[split.3.0.0-uzh-document](https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/dgs_corpus/splits/split.3.0.0-uzh-document.json)
+to remove the hard-coded train set.
+
+
+## Storage
+
+- Poses are pre-processed by normalizing the shoulders to be of size 1, 
+the face mesh coordinates are reduced to only include the face contour,
+and the legs are removed.
+
+- BIO is stored as `uint8`, with the following mapping `{"O": 0, "B": 1, "I": 2}`.
+
+We pack the data by introducing 100 empty frames between each pose file.
+This should be the same as having CNN padding in inference.
diff --git a/...language_segmentation/figures/__init__.py → sign_language_segmentation/data/__init__.py b/...language_segmentation/figures/__init__.py → sign_language_segmentation/data/__init__.py