Skip to content

WIP feat(2025): major refactor to train directly from database, new model architecture #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/lint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@ jobs:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.10'
python-version: '3.12'

- name: Install Requirements
run: pip install .[dev]
run: pip install ".[dev]"

- name: Lint Code
run: pylint sign_language_segmentation
4 changes: 2 additions & 2 deletions .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@ jobs:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.10'
python-version: '3.12'

- name: Install Requirements
run: pip install .[dev]
run: pip install ".[dev]"

- name: Test Code
run: pytest sign_language_segmentation
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,5 @@
.idea/
.idea/
**/__pycache__/
.pytest_cache/
build/
.env
15 changes: 0 additions & 15 deletions .pre-commit-config.yaml

This file was deleted.

144 changes: 50 additions & 94 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,130 +2,86 @@

Pose segmentation model on both the sentence and sign level

Code for the paper [Linguistically Motivated Sign Language Segmentation](https://aclanthology.org/2023.findings-emnlp.846).

## Usage


```bash
# Install the package
pip install git+https://github.com/sign-language-processing/segmentation
```

To create an ELAN file with sign and sentence segments:
(To demo this on a longer file, you can download a large pose file from [here](https://firebasestorage.googleapis.com/v0/b/sign-language-datasets/o/poses%2Fholistic%2Fdgs_corpus%2F1413451-11105600-11163240_a.pose?alt=media&token=432f0b57-3fb9-45ad-a9a4-0b6fae4ffcf7))
# Acquire a MediaPipe Holistic pose file
wget -O example.pose https://sign-lanugage-datasets.sign-mt.cloud/poses/holistic/dgs_corpus/1413451-11105600-11163240_a.pose

```bash
pose_to_segments --pose="sign.pose" --elan="sign.eaf" [--video="sign.mp4"]
# Run the model!
pose_to_segments --pose="example.pose" --elan="example.eaf" [--video="example.mp4"]
```

---

## Main Idea
## 2025 Version

We tag pose sequences with BIO (beginning/in/out) and try to classify each frame.
Due to huge sequence sizes intended to work on (full videos), this is not done using a transformer.
Loss is heavily weighted in favor of "B" as it is a "rare" prediction compared to I and O.


#### Pseudo code:

```python
pose_embedding = embed_pose(pose)
pose_encoding = encoder(pose_embedding)
sign_bio = sign_bio_tagger(pose_encoding)
sentence_bio = sentence_bio_tagger(pose_encoding)
```
The original version of the code supported many experimental model architectures.
In the current version, we simplify the code base, to allow continuous support.

## Extra details
### Summary of Improvements

- Model tests, including overfitting, and continuous integration
- We remove the legs because they are not informative
- For experiment management we use WANDB
- Training works on CPU and GPU (90% util)
- Multiple-GPUs not tested
| Category | Original (2023) | Current (2025) |
|-----------------------|------------------------------|-------------------------|
| Reliability | Unreliable in the first 2-3s | Should be more reliable |
| Inference Performance | Slow LSTM-based | Fast CNN-based |
| Training Efficiency | Used wasteful padding | Using packed sequences |

## Motivation
### Development

### Optical flow
Optical flow is highly correlative to phrase boundaries.

![Optical flow](sign_language_segmentation/figures/optical_fow/optical_flow_sentence_example.png)

### 3D Hand Normalization
3D hand normalization may assist the model with learning hand shape changes.

Watch [this video](https://youtu.be/pCKRWSNIaNQ?t=191) to see how it's done.

## Reproducing Experiments

### E0: Moryossef et al. (2020)
This is an attempt to reproduce the methodology of Moryossef et al. (2020) on the DGS corpus.
Since they used a different document split, and do not filter out wrong data, our results are not directly comparable. This model processes optical flow as input and outputs I (is signing) and O (not signing) tags.
<details>
<summary>Create the environment</summary>

```bash
python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=64 --encoder_depth=1 --encoder_bidirectional=false --optical_flow=true --only_optical_flow=true --weighted_loss=false --classes=io
```
conda create --name segmentation python=3.12 -y
conda activate segmentation
pip install ".[dev]"

### E1: Bidirectional BIO Tagger
We replace the IO tagging heads in E0 with BIO heads to form our baseline. Our preliminary experiments indicate that inputting only the 75 hand and body keypoints and making the LSTM layer bidirectional yields optimal results.
```bash
python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true
```
Or for the mediapi-skel dataset (only phrase segmentation)
```bash
# FPS is not relevant for mediapi-skel
export MEDIAPI_PATH=/shares/volk.cl.uzh/amoryo/datasets/mediapi/mediapi-skel.zip
export MEDIAPI_POSE_PATH=/shares/volk.cl.uzh/amoryo/datasets/mediapi/mediapipe_zips.zip
python -m sign_language_segmentation.src.train --dataset=mediapi_skel --pose=holistic --fps=0 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true
```

### E2: Adding Reduced Face Keypoints

Although the 75 hand and body keypoints serve as an efficient minimal set for sign language detection/segmentation models, we investigate the impact of other nonmanual sign language articulators, namely, the face. We introduce a reduced set of 128 face keypoints that signify the signer's face contour.
```bash
python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true --pose_components POSE_LANDMARKS LEFT_HAND_LANDMARKS RIGHT_HAND_LANDMARKS FACE_LANDMARKS --pose_reduce_face=true
```

### E3: Adding Optical Flow

At every time step $t$ we append the optical flow between $t$ and $t-1$ to the current pose frame as an additional dimension after the $XYZ$ axes.
```bash
python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true --optical_flow=true
# Confirm the environment
pylint sign_language_segmentation
pytest sign_language_segmentation
```

### E4: Adding 3D Hand Normalization
</details>

At every time step, we normalize the hand poses and concatenate them to the current pose frame.
```bash
python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true --optical_flow=true --hand_normalization=true
```
<details>
<summary>Prepare the dataset</summary>

### E5: Autoregressive Encoder
Requires access to the annotations database.

We add autoregressive connections between time steps to encourage consistent output labels. The logits at time step $t$ are concatenated to the input of the next time step, $t+1$. This modification is implemented bidirectionally by stacking two autoregressive encoders and adding their output up before the Softmax operation. However, this approach is inherently slow, as we have to fully wait for the previous time step predictions before we can feed them to the next time step.
```bash
python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=4 --encoder_bidirectional=true --encoder_autoregressive=true --optical_flow=true --hand_normalization=true --epochs=50 --patience=10
# Prepare the entire dataset
python -m sign_language_segmentation.data.create_dataset \
--poses="/Volumes/Echo/GCS/sign-mt-poses/" \
--output="/tmp/segmentation/"

# Make sure you can load the dataset
python -m sign_language_segmentation.data.dataset \
--dataset="/tmp/segmentation/"
```

CAUTION: this experiment does not improve the model as expected and runs very slowly.

## Test and Evaluation
</details>

To test and evaluate a model, add the `train=false` and `--checkpoint` flag. Take E1 as an example:
<details>
<summary>Train the model</summary>

```bash
python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true --train=false --checkpoint=./models/E1-1/best.ckpt
python -m sign_language_segmentation.src.train \
--dataset=dgs_corpus \
--pose=holistic \
--fps=25 \
--hidden_dim=256 \
--encoder_depth=1 \
--encoder_bidirectional=true
```

It's also possible to adjust the decoding algorithm by setting the `b_threshold` and the `o_threshold`:
</details>

```bash
python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true --train=false --checkpoint=./models/E1-1/best.ckpt --b_threshold=50 --o_threshold=50
```
## 2023 Version ([v2023](https://github.com/sign-language-processing/segmentation/tree/v2023))

To test on an external dataset, see [evaluate_mediapi.py](https://github.com/sign-language-processing/transcription/blob/main/sign_language_segmentation/src/evaluate_mediapi.py) for an example.

## Cite
Exact code for the
paper [Linguistically Motivated Sign Language Segmentation](https://aclanthology.org/2023.findings-emnlp.846).

```bibtex
@inproceedings{moryossef-etal-2023-linguistically,
Expand All @@ -141,4 +97,4 @@ To test on an external dataset, see [evaluate_mediapi.py](https://github.com/sig
doi = "10.18653/v1/2023.findings-emnlp.846",
pages = "12703--12724",
}
```
```
6 changes: 3 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,22 +8,22 @@ authors = [
]
readme = "README.md"
dependencies = [
"pose-format>=0.3.2",
"pose-format>=0.8.1",
"numpy",
"pympi-ling", # Working with ELAN files in CLI
"torch",
"pose_anonymization @ git+https://github.com/sign-language-processing/pose-anonymization", # Used for normalization
]

[project.optional-dependencies]
dev = [
"pytest",
"pylint",
"pytorch-lightning",
"sign-language-datasets",
"wandb",
"matplotlib",
"scikit-learn",
"pandas"
"psycopg2-binary" # to fetch the annotations from the database
]

[tool.yapf]
Expand Down
51 changes: 51 additions & 0 deletions sign_language_segmentation/args.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
import random
from argparse import ArgumentParser

import numpy as np
import torch

parser = ArgumentParser()

# wandb
parser.add_argument('--no_wandb', action='store_true', default=True, help='ignore wandb?')
parser.add_argument('--run_name', type=str, default=None, help='name of wandb run')
parser.add_argument('--wandb_dir', type=str, default='.', help='where to store wandb data')

# Training Arguments
parser.add_argument('--seed', type=int, default=42, help='random seed')
parser.add_argument('--device', type=str, default='gpu', help='device to use, cpu or gpu')
parser.add_argument('--gpus', type=int, default=1, help='how many gpus')
parser.add_argument('--epochs', type=int, default=100, help='how many epochs')
parser.add_argument('--patience', type=int, default=20, help='how many epochs as the patience for early stopping')
parser.add_argument('--batch_size', type=int, default=1, help='batch size')
parser.add_argument('--num_frames_per_item', type=int, default=2 ** 10, help='batch size')
parser.add_argument('--batch_size_devtest', type=int, default=20,
help='batch size for dev and test (by default run all in one batch)')
parser.add_argument('--learning_rate', type=float, default=1e-3, help='optimizer learning rate')
parser.add_argument('--steps_per_epoch', type=int, default=100, help='steps per epoch')

# Data Arguments
parser.add_argument('--dataset', default='/tmp/segmentation', help='which dataset to use?')

# Model Arguments
parser.add_argument('--hidden_dim', type=int, default=256, help='encoder hidden dimension')

parser.add_argument('--save_jit', action="store_true", default=False, help='whether to save model without code?')

# Prediction args
parser.add_argument('--checkpoint', type=str, default=None, metavar='PATH', help="Checkpoint path for prediction")
parser.add_argument('--pred_output', type=str, default=None, metavar='PATH', help="Path for saving prediction files")

args = parser.parse_args()

print('Agruments:', args)

# ---------------------
# Set Seed
# ---------------------
if args.seed == 0: # Make seed random if 0
args.seed = random.randint(0, 1000)
torch.manual_seed(args.seed)
np.random.seed(args.seed)
random.seed(args.seed)

26 changes: 26 additions & 0 deletions sign_language_segmentation/data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Data

We rely on https://tube.sign.mt to provide annotations of sign and sentence spans.
All annotation that start immediately, at time=0 are removed, as these are artifacts of the data creation process.

Captions with the language codes `Sgnw` (SignWriting) and `hns` (HamNoSys) are considered signs.
`gloss` (Glosses) are considered signs only if there is no whitespace in the annotation.
All other captions are considered sentences.

## Split

In [split.json](./split.json), we extend the data split from
[split.3.0.0-uzh-document](https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/dgs_corpus/splits/split.3.0.0-uzh-document.json)
to remove the hard-coded train set.


## Storage

- Poses are pre-processed by normalizing the shoulders to be of size 1,
the face mesh coordinates are reduced to only include the face contour,
and the legs are removed.

- BIO is stored as `uint8`, with the following mapping `{"O": 0, "B": 1, "I": 2}`.

We pack the data by introducing 100 empty frames between each pose file.
This should be the same as having CNN padding in inference.
Loading