Language Models

Repository of pre-trained Language Models.

WARNING: a Bidirectional LM model using the MultiFiT configuration is a good model to perform text classification but with only 46 millions of parameters, it is far from being a LM that can compete with GPT-2 or BERT in NLP tasks like text generation. This my next step ;-)

Note: The training times shown in the tables on this page are the sum of the creation time of Fastai Databunch (forward and backward) and the training duration of the bidirectional model over 10 periods. The download time of the Wikipedia corpus and its preparation time are not counted.

Portuguese

I trained 1 Portuguese Bidirectional Language Model (PBLM) with the MultiFit configuration with 1 NVIDIA GPU v100 on GCP.

MultiFiT configuration (architecture 4 QRNN with 1550 hidden parameters by layer / tokenizer SentencePiece (15 000 tokens))

notebook lm3-portuguese.ipynb (nbviewer of the notebook): code used to train a Portuguese Bidirectional LM on a 100 millions corpus extrated from Wikipedia by using the MultiFiT configuration.
link to download pre-trained parameters and vocabulary in models

PBLM	accuracy	perplexity	training time
forward	39.68%	21.76	8h
backward	43.67%	22.16	8h

Applications:
- notebook lm3-portuguese-classifier-TCU-jurisprudencia.ipynb (nbviewer of the notebook): code used to fine-tune a Portuguese Bidirectional LM and a Text Classifier on "(reduzido) TCU jurisprudência" dataset.
- notebook lm3-portuguese-classifier-olist.ipynb (nbviewer of the notebook): code used to fine-tune a Portuguese Bidirectional LM and a Sentiment Classifier on "Brazilian E-Commerce Public Dataset by Olist" dataset.

[ WARNING ] The code of this notebook lm3-portuguese-classifier-olist.ipynb must be updated in order to use the SentencePiece model and vocab already trained for the Portuguese Language Model in the notebook lm3-portuguese.ipynb as it was done in the notebook lm3-portuguese-classifier-TCU-jurisprudencia.ipynb (see explanations at the top of this notebook).

Here's an example of using the classifier to predict the category of a TCU legal text:

French

I trained 3 French Bidirectional Language Models (FBLM) with 1 NVIDIA GPU v100 on GCP but the best is the one trained with the MultiFit configuration.

French Bidirectional Language Models (FBLM)		accuracy	perplexity	training time
MultiFiT with 4 QRNN + SentencePiece (15 000 tokens)	forward	43.77%	16.09	8h40
	backward	49.29%	16.58	8h10
ULMFiT with 3 QRNN + SentencePiece (15 000 tokens)	forward	40.99%	19.96	5h30
	backward	47.19%	19.47	5h30
ULMFiT with 3 AWD-LSTM + spaCy (60 000 tokens)	forward	36.44%	25.62	11h
	backward	42.65%	27.09	11h

1. MultiFiT configuration (architecture 4 QRNN with 1550 hidden parameters by layer / tokenizer SentencePiece (15 000 tokens))

notebook lm3-french.ipynb (nbviewer of the notebook): code used to train a French Bidirectional LM on a 100 millions corpus extrated from Wikipedia by using the MultiFiT configuration.
link to download pre-trained parameters and vocabulary in models

FBLM	accuracy	perplexity	training time
forward	43.77%	16.09	8h40
backward	49.29%	16.58	8h10

Application: notebook lm3-french-classifier-amazon.ipynb (nbviewer of the notebook): code used to fine-tune a French Bidirectional LM and a Sentiment Classifier on "French Amazon Customer Reviews" dataset.

Here's an example of using the classifier to predict the feeling of comments on an amazon product:

2. Architecture QRNN / tokenizer SentencePiece

notebook lm2-french.ipynb (nbviewer of the notebook): code used to train a French Bidirectional LM on a 100 millions corpus extrated from Wikipedia
link to download pre-trained parameters and vocabulary in models

FBLM	accuracy	perplexity	training time
forward	40.99%	19.96	5h30
backward	47.19%	19.47	5h30

Application: notebook lm2-french-classifier-amazon.ipynb (nbviewer of the notebook): code used to fine-tune a French Bidirectional LM and a Sentiment Classifier on "French Amazon Customer Reviews" dataset.

3. Architecture AWD-LSTM / tokenizer spaCy

notebook lm-french.ipynb (nbviewer of the notebook): code used to train a French Bidirectional LM on a 100 millions corpus extrated from Wikipedia
link to download pre-trained parameters and vocabulary in models

FBLM	accuracy	perplexity	training time
forward	36.44%	25.62	11h
backward	42.65%	27.09	11h

Application: notebook lm-french-classifier-amazon.ipynb (nbviewer of the notebook): code used to fine-tune a French Bidirectional LM and a Sentiment Classifier on "French Amazon Customer Reviews" dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Models

Portuguese

MultiFiT configuration (architecture 4 QRNN with 1550 hidden parameters by layer / tokenizer SentencePiece (15 000 tokens))

French

1. MultiFiT configuration (architecture 4 QRNN with 1550 hidden parameters by layer / tokenizer SentencePiece (15 000 tokens))

2. Architecture QRNN / tokenizer SentencePiece

3. Architecture AWD-LSTM / tokenizer spaCy

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
models		models
README.md		README.md
converter.py		converter.py
lm-french-classifier-amazon.ipynb		lm-french-classifier-amazon.ipynb
lm-french-generator.ipynb		lm-french-generator.ipynb
lm-french.ipynb		lm-french.ipynb
lm2-french-classifier-amazon.ipynb		lm2-french-classifier-amazon.ipynb
lm2-french.ipynb		lm2-french.ipynb
lm3-french-classifier-amazon.ipynb		lm3-french-classifier-amazon.ipynb
lm3-french.ipynb		lm3-french.ipynb
lm3-portuguese-classifier-TCU-jurisprudencia.ipynb		lm3-portuguese-classifier-TCU-jurisprudencia.ipynb
lm3-portuguese-classifier-olist.ipynb		lm3-portuguese-classifier-olist.ipynb
lm3-portuguese.ipynb		lm3-portuguese.ipynb
nlputils2.py		nlputils2.py

Ganeshpadmanaban/language-models

Folders and files

Latest commit

History

Repository files navigation

Language Models

Portuguese

MultiFiT configuration (architecture 4 QRNN with 1550 hidden parameters by layer / tokenizer SentencePiece (15 000 tokens))

French

1. MultiFiT configuration (architecture 4 QRNN with 1550 hidden parameters by layer / tokenizer SentencePiece (15 000 tokens))

2. Architecture QRNN / tokenizer SentencePiece

3. Architecture AWD-LSTM / tokenizer spaCy

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages