The goal of this project is to create multi-modal implementation of Transformer architecture in Swift. It's a learning exercise for me, so I've taken it slowly, starting from simple image classifier and building it up.
Also it's an attempt to answer the question if Swift for Tensorflow is ready for non-trivial work.
The use-case is based on a paper "Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks" by Matthias Plappert. He created a nice dataset of few thousand motions "The KIT Motion-Language Dataset (paper)", website.
The Motion2Language Transformer which kind-of-works is there, already. I'm working towards completing language2motion solution.
I'm using modified Swift Transformer implementation by Andre Carrera.
- something 2 label
- image 2 label
- build image2label dataset with images representing motions
- assign 5 dummy(ish) classes with PCA and k-means on motion annotations
- classify motion images (+in fastai, +in swift)
- language 2 label
- Transformer encoder on annotation + classifier
- batched prediction
- Use BERT classifier to assign better labels - didn't work
- manually assign better labels
- motion 2 label
- 1-channel ResNet on motion + classifier
- ResNet feature extractor + Transformer encoder on motion features + classifier - didn't work
- Transformer encoder on motion + classifier
- image 2 label
- language 2 language
- Transformer seq2seq from annotation to label text
- Transformer seq2seq from annotation to (same) annotation
- motion 2 language
- Transformer from motion to annotation
- language 2 motion
- Transformer encoder on annotation
- * Transformer decoder on motion
-
original: 2017-06-22.zip
-
processed:
-
annotations and labels:
-
vocabulary