forked from huggingface/transformers
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add DeBERTa model (huggingface#5929)
* Add DeBERTa model * Remove dependency of deberta * Address comments * Patch DeBERTa Documentation Style * Add final tests * Style * Enable tests + nitpicks * position IDs * BERT -> DeBERTa * Quality * Style * Tokenization * Last updates. * @patrickvonplaten's comments * Not everything can be a copy * Apply most of @sgugger's review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Last reviews * DeBERTa -> Deberta Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
- Loading branch information
1 parent
44a93c9
commit 7a0cf0e
Showing
16 changed files
with
2,350 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
DeBERTa | ||
---------------------------------------------------- | ||
|
||
Overview | ||
~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention <https://arxiv.org/abs/2006.03654>`__ | ||
by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen | ||
It is based on Google's BERT model released in 2018 and Facebook's RoBERTa model released in 2019. | ||
|
||
It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in RoBERTa. | ||
|
||
The abstract from the paper is the following: | ||
|
||
*Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. | ||
In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa | ||
models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode | ||
its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and | ||
relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to predict the masked tokens for model pretraining. | ||
We show that these two techniques significantly improve the efficiency of model pre-training and performance of downstream tasks. Compared to | ||
RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements | ||
on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and pre-trained | ||
models will be made publicly available at https://github.com/microsoft/DeBERTa.* | ||
|
||
|
||
The original code can be found `here <https://github.com/microsoft/DeBERTa>`__. | ||
|
||
|
||
DebertaConfig | ||
~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. autoclass:: transformers.DebertaConfig | ||
:members: | ||
|
||
|
||
DebertaTokenizer | ||
~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. autoclass:: transformers.DebertaTokenizer | ||
:members: build_inputs_with_special_tokens, get_special_tokens_mask, | ||
create_token_type_ids_from_sequences, save_vocabulary | ||
|
||
|
||
DebertaModel | ||
~~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. autoclass:: transformers.DebertaModel | ||
:members: | ||
|
||
|
||
DebertaPreTrainedModel | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. autoclass:: transformers.DebertaPreTrainedModel | ||
:members: | ||
|
||
|
||
DebertaForSequenceClassification | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. autoclass:: transformers.DebertaForSequenceClassification | ||
:members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
--- | ||
thumbnail: https://huggingface.co/front/thumbnails/microsoft.png | ||
license: mit | ||
--- | ||
|
||
## DeBERTa: Decoding-enhanced BERT with Disentangled Attention | ||
|
||
[DeBERTa](https://arxiv.org/abs/2006.03654) improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder. With those two improvements, DeBERTa out perform RoBERTa on a majority of NLU tasks with 80GB training data. | ||
|
||
Please check the [official repository](https://github.com/microsoft/DeBERTa) for more details and updates. | ||
|
||
|
||
#### Fine-tuning on NLU tasks | ||
|
||
We present the dev results on SQuAD 1.1/2.0 and MNLI tasks. | ||
|
||
| Model | SQuAD 1.1 | SQuAD 2.0 | MNLI-m | | ||
|-------------------|-----------|-----------|--------| | ||
| RoBERTa-base | 91.5/84.6 | 83.7/80.5 | 87.6 | | ||
| XLNet-Large | -/- | -/80.2 | 86.8 | | ||
| **DeBERTa-base** | 93.1/87.2 | 86.2/83.1 | 88.8 | | ||
|
||
### Citation | ||
|
||
If you find DeBERTa useful for your work, please cite the following paper: | ||
|
||
``` latex | ||
@misc{he2020deberta, | ||
title={DeBERTa: Decoding-enhanced BERT with Disentangled Attention}, | ||
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen}, | ||
year={2020}, | ||
eprint={2006.03654}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.CL} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
--- | ||
thumbnail: https://huggingface.co/front/thumbnails/microsoft.png | ||
license: mit | ||
--- | ||
|
||
## DeBERTa: Decoding-enhanced BERT with Disentangled Attention | ||
|
||
[DeBERTa](https://arxiv.org/abs/2006.03654) improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder. With those two improvements, DeBERTa out perform RoBERTa on a majority of NLU tasks with 80GB training data. | ||
|
||
Please check the [official repository](https://github.com/microsoft/DeBERTa) for more details and updates. | ||
|
||
|
||
#### Fine-tuning on NLU tasks | ||
|
||
We present the dev results on SQuAD 1.1/2.0 and several GLUE benchmark tasks. | ||
|
||
| Model | SQuAD 1.1 | SQuAD 2.0 | MNLI-m | SST-2 | QNLI | CoLA | RTE | MRPC | QQP |STS-B| | ||
|-------------------|-----------|-----------|--------|-------|------|------|------|------|------|-----| | ||
| BERT-Large | 90.9/84.1 | 81.8/79.0 | 86.6 | 93.2 | 92.3 | 60.6 | 70.4 | 88.0 | 91.3 |90.0 | | ||
| RoBERTa-Large | 94.6/88.9 | 89.4/86.5 | 90.2 | 96.4 | 93.9 | 68.0 | 86.6 | 90.9 | 92.2 |92.4 | | ||
| XLNet-Large | 95.1/89.7 | 90.6/87.9 | 90.8 | 97.0 | 94.9 | 69.0 | 85.9 | 90.8 | 92.3 |92.5 | | ||
| **DeBERTa-Large** | 95.5/90.1 | 90.7/88.0 | 91.1 | 96.5 | 95.3 | 69.5 | 88.1 | 92.5 | 92.3 |92.5 | | ||
|
||
### Citation | ||
|
||
If you find DeBERTa useful for your work, please cite the following paper: | ||
|
||
``` latex | ||
@misc{he2020deberta, | ||
title={DeBERTa: Decoding-enhanced BERT with Disentangled Attention}, | ||
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen}, | ||
year={2020}, | ||
eprint={2006.03654}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.CL} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.