Skip to content

Commit

Permalink
Merge branch 'master' into opt_gpt2
Browse files Browse the repository at this point in the history
  • Loading branch information
makcedward authored Nov 19, 2019
2 parents eedbf13 + 9863867 commit 0762f78
Show file tree
Hide file tree
Showing 62 changed files with 669 additions and 430 deletions.
29 changes: 17 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,25 +44,26 @@ This python library helps you with augmenting nlp for your machine learning proj
## Augmenter
| Augmenter | Target | Augmenter | Action | Description |
|:---:|:---:|:---:|:---:|:---:|
|Textual| Character | [RandomAug](https://medium.com/hackernoon/does-your-nlp-model-able-to-prevent-adversarial-attack-45b5ab75129c) | insert, substitute, swap, delete | Apply augmentation randomly |
|Textual| Character | KeyboardAug | substitute | Simulate keyboard distance error |
|Textual| | OcrAug | substitute | Simulate OCR engine error |
|Textual| | KeyboardAug | substitute | Simulate keyboard distance error |
|Textual| Word | RandomWordAug | swap, delete | Apply augmentation randomly |
|Textual| | [RandomAug](https://medium.com/hackernoon/does-your-nlp-model-able-to-prevent-adversarial-attack-45b5ab75129c) | insert, substitute, swap, delete | Apply augmentation randomly |
|Textual| Word | AntonymAug | substitute | Substitute opposite meaning word according to WordNet antonym|
|Textual| | ContextualWordEmbsAug | insert, substitute | Feeding surroundings word to [BERT](https://towardsdatascience.com/how-bert-leverage-attention-mechanism-and-transformer-to-learn-word-contextual-relations-5bbee1b6dbdb), DistilBERT, [RoBERTa](https://medium.com/towards-artificial-intelligence/a-robustly-optimized-bert-pretraining-approach-f6b6e537e6a6) or [XLNet](https://medium.com/dataseries/why-does-xlnet-outperform-bert-da98a8503d5b) language model to find out the most suitlabe word for augmentation|
|Textual| | RandomWordAug | swap, delete | Apply augmentation randomly |
|Textual| | SpellingAug | substitute | Substitute word according to spelling mistake dictionary |
|Textual| | SynonymAug | substitute | Substitute similar word according to WordNet/ PPDB synonym |
|Textual| | AntonymAug | substitute | Substitute opposite meaning word according to WordNet antonym|
|Textual| | SplitAug | split | Split one word to two words randomly|
|Textual| | WordEmbsAug | insert, substitute | Leverage [word2vec](https://towardsdatascience.com/3-silver-bullets-of-word-embedding-in-nlp-10fa8f50cc5a), [GloVe](https://towardsdatascience.com/3-silver-bullets-of-word-embedding-in-nlp-10fa8f50cc5a) or [fasttext](https://towardsdatascience.com/3-silver-bullets-of-word-embedding-in-nlp-10fa8f50cc5a) embeddings to apply augmentation|
|Textual| | SynonymAug | substitute | Substitute similar word according to WordNet/ PPDB synonym |
|Textual| | [TfIdfAug](https://medium.com/towards-artificial-intelligence/unsupervised-data-augmentation-6760456db143) | insert, substitute | Use TF-IDF to find out how word should be augmented |
|Textual| | ContextualWordEmbsAug | insert, substitute | Feeding surroundings word to [BERT](https://towardsdatascience.com/how-bert-leverage-attention-mechanism-and-transformer-to-learn-word-contextual-relations-5bbee1b6dbdb) and [XLNet](https://medium.com/dataseries/why-does-xlnet-outperform-bert-da98a8503d5b) language model to find out the most suitlabe word for augmentation|
|Textual| Sentence | ContextualWordEmbsForSentenceAug | insert | Insert sentence according to [XLNet](https://medium.com/dataseries/why-does-xlnet-outperform-bert-da98a8503d5b) or [GPT2](https://towardsdatascience.com/too-powerful-nlp-model-generative-pre-training-2-4cc6afb6655) prediction |
|Signal| Audio | NoiseAug | substitute | Inject noise |
|Textual| | WordEmbsAug | insert, substitute | Leverage [word2vec](https://towardsdatascience.com/3-silver-bullets-of-word-embedding-in-nlp-10fa8f50cc5a), [GloVe](https://towardsdatascience.com/3-silver-bullets-of-word-embedding-in-nlp-10fa8f50cc5a) or [fasttext](https://towardsdatascience.com/3-silver-bullets-of-word-embedding-in-nlp-10fa8f50cc5a) embeddings to apply augmentation|
|Textual| Sentence | ContextualWordEmbsForSentenceAug | insert | Insert sentence according to [XLNet](https://medium.com/dataseries/why-does-xlnet-outperform-bert-da98a8503d5b), [GPT2](https://towardsdatascience.com/too-powerful-nlp-model-generative-pre-training-2-4cc6afb6655) or DistilGPT2 prediction |
|Signal| Audio | CropAug | delete | Delete audio's segment |
|Signal| | LoudnessAug|substitute | Adjust audio's volume |
|Signal| | MaskAug | substitute | Mask audio's segment |
|Signal| | NoiseAug | substitute | Inject noise |
|Signal| | PitchAug | substitute | Adjust audio's pitch |
|Signal| | ShiftAug | substitute | Shift time dimension forward/ backward |
|Signal| | SpeedAug | substitute | Adjust audio's speed |
|Signal| | CropAug | delete | Delete audio's segment |
|Signal| | LoudnessAug|substitute | Adjust audio's volume |
|Signal| | MaskAug | substitute | Mask audio's segment |
|Signal| | VtlpAug | substitute | Change vocal tract |
|Signal| Spectrogram | FrequencyMaskingAug | substitute | Set block of values to zero according to frequency dimension |
|Signal| | TimeMaskingAug | substitute | Set block of values to zero according to time dimension |

Expand Down Expand Up @@ -115,6 +116,10 @@ pip install librosa>=0.7.1
* Support inject noise to portion of audio only in audio's NoiseAug
* Introduce `zone`, `coverage` to all audio augmenter. Support only augmented portion of audio input
* Add VTLP augmentation methods (Audio's augmenter)
* Adopt latest transformer's interface [#59](https://github.com/makcedward/nlpaug/pull/59)
* Support RoBERTa (including DistilRoBERTa) and DistilBERT (ContextualWordEmbsAug)
* Support DistilGPT2 (ContextualWordEmbsForSentenceAug)
* Fix librosa hard dependency [#62](https://github.com/makcedward/nlpaug/issues/62)

**0.0.10 Nov 4, 2019
* Add aug_max to control maximum number of augmented item
Expand Down
9 changes: 6 additions & 3 deletions SOURCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,12 @@ Pre-trained Model File
* [word2vec](https://code.google.com/archive/p/word2vec/) (Google): Tomas Mikolov, Kai Chen, Greg Corrado and Jeffrey Dean released [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)
* [GloVe](https://nlp.stanford.edu/projects/glove/) (Standford): Jeffrey Pennington, Richard Socher, and Christopher D. Manning released [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf)
* [fastText](https://fasttext.cc/docs/en/english-vectors.html) (Facebook): Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch and Armand Joulin released [Advances in Pre-Training Distributed Word Representations](https://arxiv.org/pdf/1712.09405.pdf)
* [BERT](https://github.com/google-research/bert) (Google): Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova released [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805). Used [Hugging Face](https://huggingface.co/) [PyTorch version](https://github.com/huggingface/pytorch-transformers/blob/master/README.md).
* [GPT2](https://github.com/openai/gpt-2) (OpenAI): Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever released [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). Used [Hugging Face](https://huggingface.co/) [PyTorch version](https://github.com/huggingface/pytorch-transformers/blob/master/README.md).
* [XLNet](https://github.com/zihangdai/xlnet) (Google/CMU): Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le released [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237). Used [Hugging Face](https://huggingface.co/) [PyTorch version](https://github.com/huggingface/pytorch-transformers/blob/master/README.md).
* [BERT](https://github.com/google-research/bert) (Google): Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova released [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805). Used [Hugging Face](https://huggingface.co/) [PyTorch version](https://github.com/huggingface/transformers).
* [RoBERTa](https://github.com/pytorch/fairseq) (UW/Facebook): Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov released [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://medium.com/towards-artificial-intelligence/a-robustly-optimized-bert-pretraining-approach-f6b6e537e6a6). Used [Hugging Face](https://huggingface.co/) [PyTorch version](https://github.com/huggingface/transformers).
* [DistilBERT](https://github.com/huggingface/transformers) (Hugging Face): . Used [Hugging Face](https://huggingface.co/) [PyTorch version](https://github.com/huggingface/transformers).
* [GPT2](https://github.com/openai/gpt-2) (OpenAI): Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever released [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). Used [Hugging Face](https://huggingface.co/) [PyTorch version](https://github.com/huggingface/transformers).
* [DistilGPT2](https://github.com/huggingface/transformers) (Hugging Face): Used [Hugging Face](https://huggingface.co/) [PyTorch version](https://github.com/huggingface/transformers).
* [XLNet](https://github.com/zihangdai/xlnet) (Google/CMU): Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le released [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237). Used [Hugging Face](https://huggingface.co/) [PyTorch version](https://github.com/huggingface/transformers).

Raw Data Source
---------------
Expand Down
17 changes: 9 additions & 8 deletions docs/augmenter/audio/audio.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,13 @@ Audio Augmenter
===============

.. toctree::
:maxdepth: 6
:maxdepth: 6

./corp
./loudness
./mask
./noise
./pitch
./shift
./speed
./corp
./loudness
./mask
./noise
./pitch
./shift
./speed
./vtlp
7 changes: 7 additions & 0 deletions docs/augmenter/audio/vtlp.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
nlpaug.augmenter.audio\.vtlp
============================

.. automodule:: nlpaug.augmenter.audio.vtlp
:members:
:inherited-members:
:show-inheritance:
12 changes: 6 additions & 6 deletions docs/augmenter/augmenter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@ Augmenter
=========

.. toctree::
:maxdepth: 6
:maxdepth: 6

./char/char
./word/word
./sentence/sentence
./audio/audio
./spectrogram/spectrogram
./audio/audio
./char/char
./sentence/sentence
./spectrogram/spectrogram
./word/word
8 changes: 4 additions & 4 deletions docs/augmenter/char/char.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ Character Augmenter
===================

.. toctree::
:maxdepth: 6
:maxdepth: 6

./ocr
./keyboard
./random
./keyboard
./ocr
./random
4 changes: 2 additions & 2 deletions docs/augmenter/sentence/sentence.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@ Sentence Augmenter
==================

.. toctree::
:maxdepth: 6
:maxdepth: 6

./context_word_embs_sentence
./context_word_embs_sentence
6 changes: 3 additions & 3 deletions docs/augmenter/spectrogram/spectrogram.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Spectrogram Augmenter
=====================

.. toctree::
:maxdepth: 6
:maxdepth: 6

./frequency_masking
./time_masking
./frequency_masking
./time_masking
19 changes: 9 additions & 10 deletions docs/augmenter/word/word.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,13 @@ Word Augmenter
==============

.. toctree::
:maxdepth: 6
:maxdepth: 6

./random
./wordnet
./spelling
./tfidf
./word_embs
./context_word_embs
./antonym
./synonym
./split
./antonym
./context_word_embs
./random
./spelling
./split
./synonym
./tfidf
./word_embs
9 changes: 0 additions & 9 deletions docs/augmenter/word/wordnet.rst

This file was deleted.

4 changes: 2 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,9 +74,9 @@ def __getattr__(cls, name):
# built documents.
#
# The short X.Y version.
version = '0.0.10'
version = '0.0.11beta'
# The full version, including alpha/beta/rc tags.
release = '0.0.10'
release = '0.0.11beta'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
47 changes: 33 additions & 14 deletions example/audio_augmenter.ipynb

Large diffs are not rendered by default.

Loading

0 comments on commit 0762f78

Please sign in to comment.