Skip to content

Commit ef41998

Browse files
souvikchandstevhliu
andcommitted
Update docs/source/en/model_doc/albert.md
removed extra notes Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
1 parent 880b962 commit ef41998

File tree

1 file changed

+0
-5
lines changed

1 file changed

+0
-5
lines changed

docs/source/en/model_doc/albert.md

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -108,11 +108,6 @@ echo -e "Plants create [MASK] through a process known as photosynthesis." | tran
108108

109109
- Inputs should be padded on the right because BERT uses absolute position embeddings.
110110
- The embedding size `E` is different from the hidden size `H` because the embeddings are context independent (one embedding vector represents one token) and the hidden states are context dependent (one hidden state represents a sequence of tokens). The embedding matrix is also larger because `V x E` where `V` is the vocabulary size. As a result, it's more logical if `H >> E`. If `E < H`, the model has less parameters.
111-
- ALBERT supports a maximum sequence length of 512 tokens.
112-
- Cannot be used for autoregressive generation (unlike GPT)
113-
- ALBERT requires absolute positional embeddings, and it expects right-padding (i.e., pad tokens should be added at the end, not the beginning).
114-
- ALBERT uses token_type_ids, just like BERT. So you should indicate which token belongs to which segment (e.g., sentence A vs. sentence B) when doing tasks like question answering or sentence-pair classification.
115-
- ALBERT uses a different pretraining objective called Sentence Order Prediction (SOP) instead of Next Sentence Prediction (NSP), so fine-tuned models might behave slightly differently from BERT when modeling inter-sentence relationships.
116111

117112

118113
## Resources

0 commit comments

Comments
 (0)