Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
145 commits
Select commit Hold shift + click to select a range
2d53950
add VITS model
hollance Jun 6, 2023
a0160b1
let's vits
hollance Jun 7, 2023
03a9a6e
finish TextEncoder (mostly)
hollance Jun 12, 2023
3b4a42e
rename VITS to Vits
hollance Jun 12, 2023
f3d5db3
add StochasticDurationPredictor
hollance Jun 12, 2023
ac4f51a
ads flow model
hollance Jun 13, 2023
e235ec7
add generator
hollance Jun 13, 2023
da79eb0
correctly set vocab size
hollance Jun 13, 2023
4c3429c
add tokenizer
hollance Jun 13, 2023
19a4d1b
remove processor & feature extractor
hollance Jun 13, 2023
4e2d98c
add PosteriorEncoder
hollance Jun 14, 2023
fb1d546
add missing weights to SDP
hollance Jun 14, 2023
eef58ee
also convert LJSpeech and VCTK checkpoints
hollance Jun 14, 2023
d0669c8
add training stuff in forward
hollance Jun 14, 2023
8aa791b
add placeholder tests for tokenizer
hollance Jun 15, 2023
ea980d0
add placeholder tests for model
hollance Jun 15, 2023
a648dfb
starting cleanup
hollance Jun 15, 2023
c2a5478
let the great renaming begin!
hollance Jun 15, 2023
bba3c88
use config
hollance Jun 15, 2023
3dd078e
global_conditioning
hollance Jun 15, 2023
71bcb43
more cleaning
hollance Jun 15, 2023
72b2df2
renaming variables
hollance Jun 21, 2023
ba7cd6f
more renaming
hollance Jun 21, 2023
5d0577a
more renaming
hollance Jun 21, 2023
e8ebd23
it never ends
hollance Jun 21, 2023
2ad7b5e
reticulating the splines
hollance Jun 21, 2023
6bbd8a7
more renaming
hollance Jun 22, 2023
7fc673c
HiFi-GAN
hollance Jun 22, 2023
a04a905
doc strings for main model
hollance Jun 22, 2023
0afb73e
fixup
hollance Jun 22, 2023
fc3d765
fix-copies
hollance Jun 22, 2023
b979458
don't make it a PreTrainedModel
hollance Jun 22, 2023
1140cc4
fixup
hollance Jun 22, 2023
f5825f8
rename config options
hollance Jun 22, 2023
1054dc7
remove training logic from forward pass
hollance Jun 22, 2023
0cd0ff2
simplify relative position
hollance Jun 22, 2023
c67696e
use actual checkpoint
hollance Jun 22, 2023
fe71af5
style
hollance Jun 22, 2023
770800d
PR review fixes
hollance Jun 26, 2023
4a35af0
more review changes
hollance Jun 26, 2023
9c8d84c
fixup
hollance Jun 26, 2023
fd2bba0
more unit tests
hollance Jun 26, 2023
e6e747a
fixup
hollance Jun 26, 2023
11b20bc
fix doc test
hollance Jun 27, 2023
21c5052
add integration test
hollance Jun 27, 2023
455a46b
improve tokenizer tests
hollance Jun 27, 2023
65bba35
add tokenizer integration test
hollance Jun 27, 2023
3176439
fix tests on GPU (gave OOM)
hollance Jun 27, 2023
41e8f33
conversion script can handle repos from hub
hollance Jun 27, 2023
f10a1c4
add conversion script for all MMS-TTS checkpoints
hollance Jun 27, 2023
25b4cb9
automatically create a README for the converted checkpoint
hollance Jun 27, 2023
e784121
small changes to config
hollance Jun 27, 2023
c7220bf
push README to hub
hollance Jun 28, 2023
2f641d4
only show uroman note for checkpoints that need it
hollance Jun 28, 2023
c3fec01
remove conversion script because code formatting breaks the readme
hollance Jun 28, 2023
66ad4df
make WaveNet layers configurable
hollance Jun 28, 2023
be87976
rename variables
hollance Jun 28, 2023
da62c1f
simplifying the math
hollance Jun 28, 2023
e469155
output attentions and hidden states
hollance Jun 28, 2023
85c5a57
remove VitsFlip in flow model
hollance Jun 28, 2023
f0e3d8f
also got rid of the other flip
hollance Jun 28, 2023
8f0b214
fix tests
hollance Jun 28, 2023
15f01bc
rename more variables
hollance Jun 28, 2023
b64b42f
rename tokenizer, add phonemization
hollance Jun 29, 2023
dbe6cec
raise error when phonemizer missing
hollance Jun 29, 2023
96de12c
re-order config docstrings to match method
Jul 5, 2023
091925e
change config naming
Jul 5, 2023
111a20f
remove redundant str -> list
Jul 5, 2023
7c5df8b
fix copyright: vits authors -> kakao enterprise
Jul 5, 2023
9a32d2a
(mean, log_variances) -> (prior_mean, prior_log_variances)
Jul 5, 2023
3cfe9a6
if return dict -> if not return dict
Jul 5, 2023
407a5d2
speed -> speaking rate
Jul 5, 2023
1eb4297
Apply suggestions from code review
sanchit-gandhi Jul 5, 2023
f64bd01
update fused tanh sigmoid
Jul 5, 2023
c70d8c8
reduce dims in tester
Jul 5, 2023
7620928
audio -> output_values
Jul 5, 2023
2871209
audio -> output_values in tuple out
Jul 5, 2023
d974893
fix return type
Jul 5, 2023
57fc378
fix return type
Jul 5, 2023
71e8202
make _unconstrained_rational_quadratic_spline a function
Jul 5, 2023
fb64261
all nn's to accept a config
Jul 5, 2023
9f6a649
add spectro to output
Jul 5, 2023
b159a7d
move {speaking rate, noise scale, noise scale duration} to config
Jul 5, 2023
0095f7f
path -> attn_path
Jul 5, 2023
39765ed
idxs -> valid idxs -> padded idxs
Jul 5, 2023
788b71c
output values -> waveform
Jul 5, 2023
836a182
use config for attention
Jul 5, 2023
85d1f88
make generation work
Jul 10, 2023
57150cf
harden integration test
Jul 10, 2023
75a4cc2
add spectrogram to dict output
Jul 10, 2023
775337f
tokenizer refactor
Jul 10, 2023
2fef806
make style
Jul 10, 2023
35ea7d1
remove 'fake' padding token
Jul 10, 2023
d5b1f5a
harden tokenizer tests
Jul 10, 2023
38e901b
ron norm test
Jul 10, 2023
a4d8cf6
fprop / save tests deterministic
Jul 26, 2023
5089817
move uroman to tokenizer as much as possible
Jul 26, 2023
4885e0b
better logger message
Jul 26, 2023
c8ead9b
fix vivit imports
Jul 26, 2023
24b2743
add uroman integration test
Jul 26, 2023
2633355
make style
Jul 26, 2023
a6c8060
up
Jul 26, 2023
36ad9eb
matthijs -> sanchit-gandhi
Jul 26, 2023
dc6767d
fix tokenizer test
Jul 26, 2023
b65a3ec
make fix-copies
Jul 26, 2023
964ca32
fix dict comprehension
Jul 27, 2023
1b816ad
fix config tests
Jul 27, 2023
5608220
fix model tests
Jul 27, 2023
7465fb8
make outputs consistent with reverse/not reverse
Aug 18, 2023
2e13470
fix key concat
Aug 18, 2023
bfa3574
more model details
Aug 23, 2023
fda2632
add author
Aug 23, 2023
cfa52ce
return dict
Aug 23, 2023
38f2caa
speaker error
Aug 23, 2023
9cbd689
labels error
Aug 23, 2023
7c6805c
Apply suggestions from code review
sanchit-gandhi Aug 23, 2023
a2513d1
Update src/transformers/models/vits/convert_original_checkpoint.py
sanchit-gandhi Aug 23, 2023
e77c6b0
remove uromanize
Aug 23, 2023
c1561ff
add docstrings
Aug 23, 2023
0966943
add docstrings for tokenizer
Aug 23, 2023
46df4b1
upper-case skip messages
Aug 24, 2023
1a92ca7
fix return dict
Aug 24, 2023
e4a7303
style
Aug 24, 2023
1a1edbc
finish tests
Aug 24, 2023
36d3758
update checkpoints
Aug 24, 2023
9289fe4
make style
Aug 24, 2023
3f97286
remove doctest file
Aug 24, 2023
e6e80d0
revert
Aug 24, 2023
6c2ec8e
fix docstring
Aug 24, 2023
03ba786
fix tokenizer
Aug 24, 2023
d990056
remove uroman integration test
Aug 31, 2023
2e9238e
add sampling rate
Aug 31, 2023
72aa49c
fix docs / docstrings
Aug 31, 2023
054df38
style
Aug 31, 2023
ce52b98
Merge branch 'main' into vits
sanchit-gandhi Aug 31, 2023
acacd4b
add sr to model output
Aug 31, 2023
8f3e5eb
Merge remote-tracking branch 'origin/vits' into vits
Aug 31, 2023
6b63f1a
fix outputs
Aug 31, 2023
df374f8
style / copies
Aug 31, 2023
6a36784
fix docstring
Aug 31, 2023
d2414e7
fix copies
Aug 31, 2023
54261a6
remove sr from model outputs
Aug 31, 2023
ff3b08c
Update utils/documentation_tests.txt
sanchit-gandhi Aug 31, 2023
8b01633
add sr as allowed attr
Sep 1, 2023
5004f42
Merge remote-tracking branch 'origin/vits' into vits
Sep 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -489,6 +489,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
Expand Down
1 change: 1 addition & 0 deletions README_es.md
Original file line number Diff line number Diff line change
Expand Up @@ -466,6 +466,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
Expand Down
1 change: 1 addition & 0 deletions README_hd.md
Original file line number Diff line number Diff line change
Expand Up @@ -438,6 +438,7 @@ conda install -c huggingface transformers
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI से) Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. द्वाराअनुसंधान पत्र [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) के साथ जारी किया गया
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (मेटा एआई से) साथ में कागज [मास्कड ऑटोएन्कोडर स्केलेबल विजन लर्नर्स हैं](https://arxiv.org/ एब्स/2111.06377) कैमिंग हे, ज़िनेली चेन, सेनिंग ज़ी, यांगहो ली, पिओट्र डॉलर, रॉस गिर्शिक द्वारा।
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (मेटा एआई से) साथ में कागज [लेबल-कुशल सीखने के लिए मास्क्ड स्याम देश के नेटवर्क](https://arxiv. org/abs/2204.07141) महमूद असरान, मथिल्डे कैरन, ईशान मिश्रा, पियोट्र बोजानोवस्की, फ्लोरियन बोर्डेस, पास्कल विंसेंट, आर्मंड जौलिन, माइकल रब्बत, निकोलस बल्लास द्वारा।
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (Kakao Enterprise से) Jaehyeon Kim, Jungil Kong, Juhee Son. द्वाराअनुसंधान पत्र [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) के साथ जारी किया गया
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (फेसबुक एआई से) साथ में पेपर [wav2vec 2.0: ए फ्रेमवर्क फॉर सेल्फ-सुपरवाइज्ड लर्निंग ऑफ स्पीच रिप्रेजेंटेशन] (https://arxiv.org/abs/2006.11477) एलेक्सी बेवस्की, हेनरी झोउ, अब्देलरहमान मोहम्मद, माइकल औली द्वारा।
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI से) साथ वाला पेपर [FAIRSEQ S2T: FAIRSEQ के साथ फास्ट स्पीच-टू-टेक्स्ट मॉडलिंग ](https://arxiv.org/abs/2010.05171) चांगहान वांग, यूं तांग, जुताई मा, ऐनी वू, सरव्या पोपुरी, दिमित्रो ओखोनको, जुआन पिनो द्वारा पोस्ट किया गया।
Expand Down
1 change: 1 addition & 0 deletions README_ja.md
Original file line number Diff line number Diff line change
Expand Up @@ -500,6 +500,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI から) Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. から公開された研究論文 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527)
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (Meta AI から) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick から公開された研究論文: [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377)
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (Meta AI から) Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas から公開された研究論文: [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141)
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (Kakao Enterprise から) Jaehyeon Kim, Jungil Kong, Juhee Son. から公開された研究論文 [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103)
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (Facebook AI から) Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli から公開された研究論文: [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477)
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI から) Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino から公開された研究論文: [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171)
Expand Down
1 change: 1 addition & 0 deletions README_ko.md
Original file line number Diff line number Diff line change
Expand Up @@ -415,6 +415,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI 에서 제공)은 Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.의 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527)논문과 함께 발표했습니다.
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (Meta AI 에서) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick 의 [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 논문과 함께 발표했습니다.
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (Meta AI 에서) Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas 의 [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) 논문과 함께 발표했습니다.
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (Kakao Enterprise 에서 제공)은 Jaehyeon Kim, Jungil Kong, Juhee Son.의 [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103)논문과 함께 발표했습니다.
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (Facebook AI 에서) Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli 의 [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) 논문과 함께 발표했습니다.
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI 에서) Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino 의 [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 논문과 함께 발표했습니다.
Expand Down
1 change: 1 addition & 0 deletions README_zh-hans.md
Original file line number Diff line number Diff line change
Expand Up @@ -439,6 +439,7 @@ conda install -c huggingface transformers
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (来自 Meta AI) 伴随论文 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) 由 Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He 发布。
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (来自 Meta AI) 伴随论文 [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 由 Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick 发布。
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (来自 Meta AI) 伴随论文 [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas 发布.
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (来自 Kakao Enterprise) 伴随论文 [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) 由 Jaehyeon Kim, Jungil Kong, Juhee Son 发布。
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (来自 Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) 由 Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (来自 Facebook AI) 伴随论文 [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) 由 Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli 发布。
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (来自 Facebook AI) 伴随论文 [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 由 Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino 发布。
Expand Down
1 change: 1 addition & 0 deletions README_zh-hant.md
Original file line number Diff line number Diff line change
Expand Up @@ -451,6 +451,7 @@ conda install -c huggingface transformers
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
Expand Down
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -603,6 +603,8 @@
title: UniSpeech
- local: model_doc/unispeech-sat
title: UniSpeech-SAT
- local: model_doc/vits
title: VITS
- local: model_doc/wav2vec2
title: Wav2Vec2
- local: model_doc/wav2vec2-conformer
Expand Down
2 changes: 2 additions & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,7 @@ The documentation is organized into five sections:
1. **[VitDet](model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
1. **[ViTMAE](model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
1. **[ViTMSN](model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
1. **[VITS](model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
1. **[ViViT](model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
1. **[Wav2Vec2](model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
1. **[Wav2Vec2-Conformer](model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
Expand Down Expand Up @@ -475,6 +476,7 @@ Flax), PyTorch, and/or TensorFlow.
| VitDet | ✅ | ❌ | ❌ |
| ViTMAE | ✅ | ✅ | ❌ |
| ViTMSN | ✅ | ❌ | ❌ |
| VITS | ✅ | ❌ | ❌ |
| ViViT | ✅ | ❌ | ❌ |
| Wav2Vec2 | ✅ | ✅ | ✅ |
| Wav2Vec2-Conformer | ✅ | ❌ | ❌ |
Expand Down
Loading