Skip to content

Commit 0597290

Browse files
committed
update reference links
1 parent 35ae069 commit 0597290

File tree

2 files changed

+26
-15
lines changed

2 files changed

+26
-15
lines changed

src/index.md

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -538,7 +538,10 @@ Though some previous works have referred to this as "sign language translation,"
538538
without handling the syntax and morphology of the signed language [@padden1988interaction] to create a spoken language output.
539539
Instead, SLR has often been used as an intermediate step during translation to produce glosses from signed language videos.
540540

541-
@jiang2021sign propose a novel Skeleton Aware Multi-modal Framework with a Global Ensemble Model (GEM) for isolated SLR (SAM-SLR-v2) to learn and fuse multimodal feature representations. Specifically, they use a Sign Language Graph Convolution Network (SL-GCN) to model the embedded dynamics of skeleton keypoints and a Separable Spatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. The proposed late-fusion GEM fuses the skeleton-based predictions with other RGB and depth-based modalities to provide global information and make an accurate SLR prediction. @jiao2023cosign explore co-occurence signals in skeleton data to better exploit the knowledge of each signal for continuous SLR. Specifically, they use Group-specific GCN to abstract skeleton features from co-occurence signals (Body, Hand, Mouth and Hand) and introduce complementary regularization to ensure consistency between predictions based on two complementary subsets of signals. Additionally, they propose a two-stream framework to fuse static and dynamic information. The model demonstrates competitive performance cpmpared to video-to-gloss methods on the RWTH-PHOENIX-Weather-2014 [@koller2015ContinuousSLR], RWTH-PHOENIX-Weather-2014T [@cihan2018neural] and CSL-Daily [@dataset:Zhou2021_SignBackTranslation_CSLDaily] datasets.
541+
@jiang2021sign proposed a novel Skeleton Aware Multi-modal Framework with a Global Ensemble Model (GEM) for isolated SLR (SAM-SLR-v2) to learn and fuse multimodal feature representations. Specifically, they use a Sign Language Graph Convolution Network (SL-GCN) to model the embedded dynamics of skeleton keypoints and a Separable Spatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. The proposed late-fusion GEM fuses the skeleton-based predictions with other RGB and depth-based modalities to provide global information and make an accurate SLR prediction.
542+
@jiao2023cosign explore co-occurence signals in skeleton data to better exploit the knowledge of each signal for continuous SLR. Specifically, they use Group-specific GCN to abstract skeleton features from co-occurence signals (Body, Hand, Mouth and Hand) and introduce complementary regularization to ensure consistency between predictions based on two complementary subsets of signals.
543+
Additionally, they propose a two-stream framework to fuse static and dynamic information.
544+
The model demonstrates competitive performance cpmpared to video-to-gloss methods on the RWTH-PHOENIX-Weather-2014 [@koller2015ContinuousSLR], RWTH-PHOENIX-Weather-2014T [@cihan2018neural] and CSL-Daily [@dataset:Zhou2021_SignBackTranslation_CSLDaily] datasets.
542545

543546
@dafnis2022bidirectional work on the same modified WLASL dataset as @jiang2021sign, but do not require multimodal data input. Instead, they propose a bidirectional skeleton-based graph convolutional network framework with linguistically motivated parameters and attention to the start and end frames of signs. They cooperatively use forward and backward data streams, including various sub-streams, as input. They also use pre-training to leverage transfer learning.
544547

@@ -586,10 +589,10 @@ For this recognition, @cui2017recurrent constructs a three-step optimization mod
586589
First, they train a video-to-gloss end-to-end model, where they encode the video using a spatio-temporal CNN encoder
587590
and predict the gloss using a Connectionist Temporal Classification (CTC) [@graves2006connectionist].
588591
Then, from the CTC alignment and category proposal, they encode each gloss-level segment independently, trained to predict the gloss category,
589-
and use this gloss video segments encoding to optimize the sequence learning model. @cheng2020fully propose a fully convolutional networks for continuous SLR,
590-
moving away from LSTM-based methods to achieve end-to-end learning. They introduce a gloss feature enhancement (GFE) module to provide additional rectified supervision and
591-
accelerate the training process. @min2021visual attribute the success of iterative training to its ability to reduce overfitting. They propose visual enhancement
592-
constraint (VEC) and visual alignment constraint (VAC) to strengthen the visual extractor and align long- and short-term predictions, enabling LSTM-based methods to be trained in an end-to-end manner.
592+
and use this gloss video segments encoding to optimize the sequence learning model.
593+
@cheng2020fully propose a fully convolutional network for continuous SLR, moving away from LSTM-based methods to achieve end-to-end learning.
594+
They introduce a Gloss Feature Enhancement (GFE) module to provide additional rectified supervision and accelerate the training process.
595+
@min2021visual attribute the success of iterative training to its ability to reduce overfitting. They propose Visual Enhancement Constraint (VEC) and Visual Alignment Constraint (VAC) to strengthen the visual extractor and align long- and short-term predictions, enabling LSTM-based methods to be trained in an end-to-end manner.
593596
They provide a [code implementation](https://github.com/VIPL-SLP/VAC_CSLR).
594597

595598
@cihan2018neural fundamentally differ from that approach and formulate this problem as if it is a natural-language translation problem.
@@ -745,9 +748,10 @@ The model features shared representations for different modalities such as text
745748
on several tasks such as video-to-gloss, gloss-to-text, and video-to-text.
746749
The approach allows leveraging external data such as parallel data for spoken language machine translation.
747750

748-
@zhou2023gloss propose the GFSLT-VLP framework for gloss-free sign language translation, which improves SLT performance through visual-alignment pretraining. In the pretraining stage, they design a pretext task that aligns visual and textual
749-
representations within a joint multimodal semantic space, enabling the Visual Encoder to learn language-indicated visual representations. Additionally, they incorporate masked self-supervised learning into the pre-training
750-
process to help the text decoder capture the syntactic and semantic properties of sign language sentences more effectively. The approach achieves competitive results on the RWTH-PHOENIX-Weather-2014T [@cihan2018neural] and CSL-Daily [@dataset:Zhou2021_SignBackTranslation_CSLDaily] datasets. They provide a [code implementation](https://github.com/zhoubenjia/GFSLT-VLP).
751+
@zhou2023gloss propose Gloss-Free Sign Language Translation with Visual Alignment Pretraining (GFSLT-VLP) to improve SLT performance through visual-alignment pretraining.
752+
In the pretraining stage, they design a pretext task that aligns visual and textual representations within a joint multimodal semantic space, enabling the Visual Encoder to learn language-indicated visual representations.
753+
Additionally, they incorporate masked self-supervised learning into the pre-training process to help the text decoder capture the syntactic and semantic properties of sign language sentences more effectively.
754+
The approach achieves competitive results on the RWTH-PHOENIX-Weather-2014T [@cihan2018neural] and CSL-Daily [@dataset:Zhou2021_SignBackTranslation_CSLDaily] datasets. They provide a [code implementation](https://github.com/zhoubenjia/GFSLT-VLP).
751755

752756
@Zhao_Zhang_Fu_Hu_Su_Chen_2024 introduce CV-SLT, employing conditional variational autoencoders to address the modality gap between video and text.
753757
Their approach involves guiding the model to encode visual and textual data similarly through two paths: one with visual data alone and one with both modalities.
@@ -798,9 +802,11 @@ and showed similar performance, with the transformer underperforming on the vali
798802
They experimented with various normalization schemes, mainly subtracting the mean and dividing by the standard deviation of every individual keypoint
799803
either concerning the entire frame or the relevant "object" (Body, Face, and Hand).
800804

801-
@jiao2024visual propose a visual alignment pre-training framework for gloss-free sign language translation. Specifically, they adopt Cosign-1s [@jiao2023cosign] to obtain skeleton features from estimated pose sequences
802-
and a pretrained text encoder to obtain corresponding textual features. During pretraining, these visual and textual features are aligned in a greedy manner. In the finetuning stage, they replace the shallow translation module
803-
used in pretraining with a pretrained translation module. This skeleton-based approach achieves state-of-the-art results on the RWTH-PHOENIX-Weather-2014T [@cihan2018neural], CSL-Daily [@dataset:Zhou2021_SignBackTranslation_CSLDaily], OpenASL [@shi-etal-2022-open], and How2Sign[@dataset:duarte2020how2sign] datasets without relying on gloss annotations.
805+
@jiao2024visual propose a visual alignment pre-training framework for gloss-free sign language translation.
806+
Specifically, they adopt CoSign-1s [@jiao2023cosign] to obtain skeleton features from estimated pose sequences and a pretrained text encoder to obtain corresponding textual features.
807+
During pretraining, these visual and textual features are aligned in a greedy manner.
808+
In the finetuning stage, they replace the shallow translation module used in pretraining with a pretrained translation module.
809+
This skeleton-based approach achieves state-of-the-art results on the RWTH-PHOENIX-Weather-2014T [@cihan2018neural], CSL-Daily [@dataset:Zhou2021_SignBackTranslation_CSLDaily], OpenASL [@shi-etal-2022-open], and How2Sign [@dataset:duarte2020how2sign] datasets without relying on gloss annotations.
804810

805811
#### Text-to-Pose
806812
Text-to-Pose, also known as sign language production, is the task of producing a sequence of poses that adequately represent

src/references.bib

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1551,7 +1551,8 @@ @inproceedings{jiao2023cosign
15511551
author={Jiao, Peiqi and Min, Yuecong and Li, Yanan and Wang, Xiaotao and Lei, Lei and Chen, Xilin},
15521552
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
15531553
pages={20676--20686},
1554-
year={2023}
1554+
year={2023},
1555+
url={https://openaccess.thecvf.com/content/ICCV2023/html/Jiao_CoSign_Exploring_Co-occurrence_Signals_in_Skeleton-based_Continuous_Sign_Language_Recognition_ICCV_2023_paper.html}
15551556
}
15561557

15571558
@inproceedings{dafnis2022bidirectional,
@@ -1640,15 +1641,17 @@ @inproceedings{cheng2020fully
16401641
booktitle={Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXIV 16},
16411642
pages={697--714},
16421643
year={2020},
1643-
organization={Springer}
1644+
organization={Springer},
1645+
url={https://www.ecva.net/papers/eccv_2020/papers_ECCV/html/4763_ECCV_2020_paper.php}
16441646
}
16451647

16461648
@inproceedings{min2021visual,
16471649
title={Visual alignment constraint for continuous sign language recognition},
16481650
author={Min, Yuecong and Hao, Aiming and Chai, Xiujuan and Chen, Xilin},
16491651
booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
16501652
pages={11542--11551},
1651-
year={2021}
1653+
year={2021},
1654+
url={https://openaccess.thecvf.com/content/ICCV2021/html/Min_Visual_Alignment_Constraint_for_Continuous_Sign_Language_Recognition_ICCV_2021_paper.html},
16521655
}
16531656

16541657
@article{carreira2017quo,
@@ -3074,7 +3077,8 @@ @inproceedings{zhou2023gloss
30743077
author={Zhou, Benjia and Chen, Zhigang and Clap{\'e}s, Albert and Wan, Jun and Liang, Yanyan and Escalera, Sergio and Lei, Zhen and Zhang, Du},
30753078
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
30763079
pages={20871--20881},
3077-
year={2023}
3080+
year={2023},
3081+
url={https://openaccess.thecvf.com/content/CVPR2023/html/Yin_Gloss_Attention_for_Gloss-Free_Sign_Language_Translation_CVPR_2023_paper.html},
30783082
}
30793083

30803084
@inproceedings{jiao2024visual,
@@ -3083,6 +3087,7 @@ @inproceedings{jiao2024visual
30833087
booktitle={European Conference on Computer Vision},
30843088
pages={349--367},
30853089
year={2024},
3090+
url = {https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/5894_ECCV_2024_paper.php},
30863091
organization={Springer}
30873092
}
30883093

0 commit comments

Comments
 (0)