Skip to content

Commit a646fd5

Browse files
Updated CamemBERT model card to new standardized format (#39227)
* Updated CamemBERT model card to new standardized format * Applied review suggestions for CamemBERT: restored API refs, added examples, badges, and attribution * Updated CamemBERT usage examples, quantization, badges, and format * Updated CamemBERT badges * Fixed CLI Section
1 parent af74ec6 commit a646fd5

File tree

1 file changed

+88
-33
lines changed

1 file changed

+88
-33
lines changed

docs/source/en/model_doc/camembert.md

Lines changed: 88 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -14,49 +14,105 @@ rendered properly in your Markdown viewer.
1414
1515
-->
1616

17+
<div style="float: right;">
18+
<div class="flex flex-wrap space-x-1">
19+
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
20+
<img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white">
21+
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
22+
</div>
23+
</div>
24+
1725
# CamemBERT
1826

19-
<div class="flex flex-wrap space-x-1">
20-
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
21-
<img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white">
22-
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
23-
</div>
27+
[CamemBERT](https://huggingface.co/papers/1911.03894) is a language model based on [RoBERTa](./roberta), but trained specifically on French text from the OSCAR dataset, making it more effective for French language tasks.
28+
29+
What sets CamemBERT apart is that it learned from a huge, high quality collection of French data, as opposed to mixing lots of languages. This helps it really understand French better than many multilingual models.
30+
31+
Common applications of CamemBERT include masked language modeling (Fill-mask prediction), text classification (sentiment analysis), token classification (entity recognition) and sentence pair classification (entailment tasks).
32+
33+
You can find all the original CamemBERT checkpoints under the [ALMAnaCH](https://huggingface.co/almanach/models?search=camembert) organization.
34+
35+
> [!TIP]
36+
> This model was contributed by the [ALMAnaCH (Inria)](https://huggingface.co/almanach) team.
37+
>
38+
> Click on the CamemBERT models in the right sidebar for more examples of how to apply CamemBERT to different NLP tasks.
39+
40+
The examples below demonstrate how to predict the `<mask>` token with [`Pipeline`], [`AutoModel`], and from the command line.
41+
42+
<hfoptions id="usage">
43+
44+
<hfoption id="Pipeline">
45+
46+
```python
47+
import torch
48+
from transformers import pipeline
2449

25-
## Overview
50+
pipeline = pipeline("fill-mask", model="camembert-base", torch_dtype=torch.float16, device=0)
51+
pipeline("Le camembert est un délicieux fromage <mask>.")
52+
```
53+
</hfoption>
2654

27-
The CamemBERT model was proposed in [CamemBERT: a Tasty French Language Model](https://huggingface.co/papers/1911.03894) by
28-
[Louis Martin](https://huggingface.co/louismartin), [Benjamin Muller](https://huggingface.co/benjamin-mlr), [Pedro Javier Ortiz Suárez](https://huggingface.co/pjox), Yoann Dupont, Laurent Romary, Éric Villemonte de la
29-
Clergerie, [Djamé Seddah](https://huggingface.co/Djame), and [Benoît Sagot](https://huggingface.co/sagot). It is based on Facebook's RoBERTa model released in 2019. It is a model
30-
trained on 138GB of French text.
55+
<hfoption id="AutoModel">
3156

32-
The abstract from the paper is the following:
57+
```python
58+
import torch
59+
from transformers import AutoTokenizer, AutoModelForMaskedLM
3360

34-
*Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available
35-
models have either been trained on English data or on the concatenation of data in multiple languages. This makes
36-
practical use of such models --in all languages except English-- very limited. Aiming to address this issue for French,
37-
we release CamemBERT, a French version of the Bi-directional Encoders for Transformers (BERT). We measure the
38-
performance of CamemBERT compared to multilingual models in multiple downstream tasks, namely part-of-speech tagging,
39-
dependency parsing, named-entity recognition, and natural language inference. CamemBERT improves the state of the art
40-
for most of the tasks considered. We release the pretrained model for CamemBERT hoping to foster research and
41-
downstream applications for French NLP.*
61+
tokenizer = AutoTokenizer.from_pretrained("camembert-base")
62+
model = AutoModelForMaskedLM.from_pretrained("camembert-base", torch_dtype="auto", device_map="auto", attn_implementation="sdpa")
63+
inputs = tokenizer("Le camembert est un délicieux fromage <mask>.", return_tensors="pt").to("cuda")
4264

43-
This model was contributed by [the ALMAnaCH team (Inria)](https://huggingface.co/almanach). The original code can be found [here](https://camembert-model.fr/).
65+
with torch.no_grad():
66+
outputs = model(**inputs)
67+
predictions = outputs.logits
4468

45-
<Tip>
69+
masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
70+
predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
71+
predicted_token = tokenizer.decode(predicted_token_id)
4672

47-
This implementation is the same as RoBERTa. Refer to the [documentation of RoBERTa](roberta) for usage examples as well
48-
as the information relative to the inputs and outputs.
73+
print(f"The predicted token is: {predicted_token}")
74+
```
75+
</hfoption>
4976

50-
</Tip>
77+
<hfoption id="transformers CLI">
5178

52-
## Resources
79+
```bash
80+
echo -e "Le camembert est un délicieux fromage <mask>." | transformers run --task fill-mask --model camembert-base --device 0
81+
```
5382

54-
- [Text classification task guide](../tasks/sequence_classification)
55-
- [Token classification task guide](../tasks/token_classification)
56-
- [Question answering task guide](../tasks/question_answering)
57-
- [Causal language modeling task guide](../tasks/language_modeling)
58-
- [Masked language modeling task guide](../tasks/masked_language_modeling)
59-
- [Multiple choice task guide](../tasks/multiple_choice)
83+
</hfoption>
84+
85+
</hfoptions>
86+
87+
88+
Quantization reduces the memory burden of large models by representing weights in lower precision. Refer to the [Quantization](../quantization/overview) overview for available options.
89+
90+
The example below uses [bitsandbytes](../quantization/bitsandbytes) quantization to quantize the weights to 8-bits.
91+
92+
```python
93+
from transformers import AutoTokenizer, AutoModelForMaskedLM, BitsAndBytesConfig
94+
import torch
95+
96+
quant_config = BitsAndBytesConfig(load_in_8bit=True)
97+
model = AutoModelForMaskedLM.from_pretrained(
98+
"almanach/camembert-large",
99+
quantization_config=quant_config,
100+
device_map="auto"
101+
)
102+
tokenizer = AutoTokenizer.from_pretrained("almanach/camembert-large")
103+
104+
inputs = tokenizer("Le camembert est un délicieux fromage <mask>.", return_tensors="pt").to("cuda")
105+
106+
with torch.no_grad():
107+
outputs = model(**inputs)
108+
predictions = outputs.logits
109+
110+
masked_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
111+
predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
112+
predicted_token = tokenizer.decode(predicted_token_id)
113+
114+
print(f"The predicted token is: {predicted_token}")
115+
```
60116

61117
## CamembertConfig
62118

@@ -137,5 +193,4 @@ as the information relative to the inputs and outputs.
137193
[[autodoc]] TFCamembertForQuestionAnswering
138194

139195
</tf>
140-
</frameworkcontent>
141-
196+
</frameworkcontent>

0 commit comments

Comments
 (0)