Merge branch 'master' into opt_gpt2

makcedward · Nov 19, 2019 · 0762f78 · 0762f78
2 parents eedbf13 + 9863867
commit 0762f78
Show file tree

Hide file tree

Showing 62 changed files with 669 additions and 430 deletions.
diff --git a/README.md b/README.md
@@ -44,25 +44,26 @@ This python library helps you with augmenting nlp for your machine learning proj
 ## Augmenter
 | Augmenter | Target | Augmenter | Action | Description |
 |:---:|:---:|:---:|:---:|:---:|
-|Textual| Character | [RandomAug](https://medium.com/hackernoon/does-your-nlp-model-able-to-prevent-adversarial-attack-45b5ab75129c) | insert, substitute, swap, delete | Apply augmentation randomly |
+|Textual| Character | KeyboardAug | substitute | Simulate keyboard distance error |
 |Textual| | OcrAug | substitute | Simulate OCR engine error |
-|Textual| | KeyboardAug | substitute | Simulate keyboard distance error |
-|Textual| Word | RandomWordAug | swap, delete | Apply augmentation randomly |
+|Textual| | [RandomAug](https://medium.com/hackernoon/does-your-nlp-model-able-to-prevent-adversarial-attack-45b5ab75129c) | insert, substitute, swap, delete | Apply augmentation randomly |
+|Textual| Word | AntonymAug | substitute | Substitute opposite meaning word according to WordNet antonym|
+|Textual| | ContextualWordEmbsAug | insert, substitute | Feeding surroundings word to [BERT](https://towardsdatascience.com/how-bert-leverage-attention-mechanism-and-transformer-to-learn-word-contextual-relations-5bbee1b6dbdb), DistilBERT, [RoBERTa](https://medium.com/towards-artificial-intelligence/a-robustly-optimized-bert-pretraining-approach-f6b6e537e6a6) or [XLNet](https://medium.com/dataseries/why-does-xlnet-outperform-bert-da98a8503d5b) language model to find out the most suitlabe word for augmentation|
+|Textual| | RandomWordAug | swap, delete | Apply augmentation randomly |
 |Textual| | SpellingAug | substitute | Substitute word according to spelling mistake dictionary |
-|Textual| | SynonymAug | substitute | Substitute similar word according to WordNet/ PPDB synonym |
-|Textual| | AntonymAug | substitute | Substitute opposite meaning word according to WordNet antonym|
 |Textual| | SplitAug | split | Split one word to two words randomly|
-|Textual| | WordEmbsAug | insert, substitute | Leverage  [word2vec](https://towardsdatascience.com/3-silver-bullets-of-word-embedding-in-nlp-10fa8f50cc5a), [GloVe](https://towardsdatascience.com/3-silver-bullets-of-word-embedding-in-nlp-10fa8f50cc5a) or [fasttext](https://towardsdatascience.com/3-silver-bullets-of-word-embedding-in-nlp-10fa8f50cc5a) embeddings to apply augmentation|
+|Textual| | SynonymAug | substitute | Substitute similar word according to WordNet/ PPDB synonym |
 |Textual| | [TfIdfAug](https://medium.com/towards-artificial-intelligence/unsupervised-data-augmentation-6760456db143) | insert, substitute | Use TF-IDF to find out how word should be augmented |
-|Textual| | ContextualWordEmbsAug | insert, substitute | Feeding surroundings word to  [BERT](https://towardsdatascience.com/how-bert-leverage-attention-mechanism-and-transformer-to-learn-word-contextual-relations-5bbee1b6dbdb) and [XLNet](https://medium.com/dataseries/why-does-xlnet-outperform-bert-da98a8503d5b) language model to find out the most suitlabe word for augmentation|
-|Textual| Sentence | ContextualWordEmbsForSentenceAug | insert | Insert sentence according to [XLNet](https://medium.com/dataseries/why-does-xlnet-outperform-bert-da98a8503d5b)  or [GPT2](https://towardsdatascience.com/too-powerful-nlp-model-generative-pre-training-2-4cc6afb6655) prediction |
-|Signal| Audio | NoiseAug | substitute | Inject noise |
+|Textual| | WordEmbsAug | insert, substitute | Leverage  [word2vec](https://towardsdatascience.com/3-silver-bullets-of-word-embedding-in-nlp-10fa8f50cc5a), [GloVe](https://towardsdatascience.com/3-silver-bullets-of-word-embedding-in-nlp-10fa8f50cc5a) or [fasttext](https://towardsdatascience.com/3-silver-bullets-of-word-embedding-in-nlp-10fa8f50cc5a) embeddings to apply augmentation|
+|Textual| Sentence | ContextualWordEmbsForSentenceAug | insert | Insert sentence according to [XLNet](https://medium.com/dataseries/why-does-xlnet-outperform-bert-da98a8503d5b), [GPT2](https://towardsdatascience.com/too-powerful-nlp-model-generative-pre-training-2-4cc6afb6655) or DistilGPT2 prediction |
+|Signal| Audio | CropAug | delete | Delete audio's segment |
+|Signal| | LoudnessAug|substitute | Adjust audio's volume |
+|Signal| | MaskAug | substitute | Mask audio's segment |
+|Signal| | NoiseAug | substitute | Inject noise |
 |Signal| | PitchAug | substitute | Adjust audio's pitch |
 |Signal| | ShiftAug | substitute | Shift time dimension forward/ backward |
 |Signal| | SpeedAug | substitute | Adjust audio's speed |
-|Signal| | CropAug | delete | Delete audio's segment |
-|Signal| | LoudnessAug|substitute | Adjust audio's volume |
-|Signal| | MaskAug | substitute | Mask audio's segment |
+|Signal| | VtlpAug | substitute | Change vocal tract |
 |Signal| Spectrogram | FrequencyMaskingAug | substitute | Set block of values to zero according to frequency dimension |
 |Signal| | TimeMaskingAug | substitute | Set block of values to zero according to time dimension |
 
@@ -115,6 +116,10 @@ pip install librosa>=0.7.1
 *   Support inject noise to portion of audio only in audio's NoiseAug
 *   Introduce `zone`, `coverage` to all audio augmenter. Support only augmented portion of audio input
 *   Add VTLP augmentation methods (Audio's augmenter)
+*   Adopt latest transformer's interface [#59](https://github.com/makcedward/nlpaug/pull/59)
+*   Support RoBERTa (including DistilRoBERTa) and DistilBERT (ContextualWordEmbsAug)
+*   Support DistilGPT2 (ContextualWordEmbsForSentenceAug)
+*   Fix librosa hard dependency [#62](https://github.com/makcedward/nlpaug/issues/62)
 
 **0.0.10 Nov 4, 2019
 *   Add aug_max to control maximum number of augmented item

diff --git a/SOURCE.md b/SOURCE.md
@@ -6,9 +6,12 @@ Pre-trained Model File
 *  [word2vec](https://code.google.com/archive/p/word2vec/) (Google): Tomas Mikolov, Kai Chen, Greg Corrado and Jeffrey Dean released [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)
 *  [GloVe](https://nlp.stanford.edu/projects/glove/) (Standford): Jeffrey Pennington, Richard Socher, and Christopher D. Manning released [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf)
 *  [fastText](https://fasttext.cc/docs/en/english-vectors.html) (Facebook): Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch and Armand Joulin released [Advances in Pre-Training Distributed Word Representations](https://arxiv.org/pdf/1712.09405.pdf)
-*  [BERT](https://github.com/google-research/bert) (Google): Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova released [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805). Used [Hugging Face](https://huggingface.co/) [PyTorch version](https://github.com/huggingface/pytorch-transformers/blob/master/README.md).
-*  [GPT2](https://github.com/openai/gpt-2) (OpenAI): Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever released [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). Used [Hugging Face](https://huggingface.co/) [PyTorch version](https://github.com/huggingface/pytorch-transformers/blob/master/README.md).
-*  [XLNet](https://github.com/zihangdai/xlnet) (Google/CMU): Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le released [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237). Used [Hugging Face](https://huggingface.co/) [PyTorch version](https://github.com/huggingface/pytorch-transformers/blob/master/README.md).
+*  [BERT](https://github.com/google-research/bert) (Google): Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova released [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805). Used [Hugging Face](https://huggingface.co/) [PyTorch version](https://github.com/huggingface/transformers).
+*  [RoBERTa](https://github.com/pytorch/fairseq) (UW/Facebook): Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov released [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://medium.com/towards-artificial-intelligence/a-robustly-optimized-bert-pretraining-approach-f6b6e537e6a6). Used [Hugging Face](https://huggingface.co/) [PyTorch version](https://github.com/huggingface/transformers).
+*  [DistilBERT](https://github.com/huggingface/transformers) (Hugging Face): . Used [Hugging Face](https://huggingface.co/) [PyTorch version](https://github.com/huggingface/transformers).
+*  [GPT2](https://github.com/openai/gpt-2) (OpenAI): Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever released [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). Used [Hugging Face](https://huggingface.co/) [PyTorch version](https://github.com/huggingface/transformers).
+*  [DistilGPT2](https://github.com/huggingface/transformers) (Hugging Face): Used [Hugging Face](https://huggingface.co/) [PyTorch version](https://github.com/huggingface/transformers).
+*  [XLNet](https://github.com/zihangdai/xlnet) (Google/CMU): Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le released [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237). Used [Hugging Face](https://huggingface.co/) [PyTorch version](https://github.com/huggingface/transformers).
 
 Raw Data Source
 ---------------

diff --git a/docs/augmenter/audio/audio.rst b/docs/augmenter/audio/audio.rst
@@ -2,12 +2,13 @@ Audio Augmenter
 ===============
 
 .. toctree::
-   :maxdepth: 6
+    :maxdepth: 6
 
-   ./corp
-   ./loudness
-   ./mask
-   ./noise
-   ./pitch
-   ./shift
-   ./speed
+    ./corp
+    ./loudness
+    ./mask
+    ./noise
+    ./pitch
+    ./shift
+    ./speed
+    ./vtlp
diff --git a/docs/augmenter/audio/vtlp.rst b/docs/augmenter/audio/vtlp.rst
@@ -0,0 +1,7 @@
+nlpaug.augmenter.audio\.vtlp
+============================
+
+.. automodule:: nlpaug.augmenter.audio.vtlp
+    :members:
+    :inherited-members:
+    :show-inheritance:
diff --git a/docs/augmenter/augmenter.rst b/docs/augmenter/augmenter.rst
@@ -2,10 +2,10 @@ Augmenter
 =========
 
 .. toctree::
-   :maxdepth: 6
+    :maxdepth: 6
 
-   ./char/char
-   ./word/word
-   ./sentence/sentence
-   ./audio/audio
-   ./spectrogram/spectrogram
+    ./audio/audio
+    ./char/char
+    ./sentence/sentence
+    ./spectrogram/spectrogram
+    ./word/word
diff --git a/docs/augmenter/char/char.rst b/docs/augmenter/char/char.rst
@@ -2,8 +2,8 @@ Character Augmenter
 ===================
 
 .. toctree::
-   :maxdepth: 6
+    :maxdepth: 6
 
-   ./ocr
-   ./keyboard
-   ./random
+    ./keyboard
+    ./ocr
+    ./random
diff --git a/docs/augmenter/sentence/sentence.rst b/docs/augmenter/sentence/sentence.rst
@@ -2,6 +2,6 @@ Sentence Augmenter
 ==================
 
 .. toctree::
-   :maxdepth: 6
+    :maxdepth: 6
 
-   ./context_word_embs_sentence
+    ./context_word_embs_sentence
diff --git a/docs/augmenter/spectrogram/spectrogram.rst b/docs/augmenter/spectrogram/spectrogram.rst
@@ -2,7 +2,7 @@ Spectrogram Augmenter
 =====================
 
 .. toctree::
-   :maxdepth: 6
+    :maxdepth: 6
 
-   ./frequency_masking
-   ./time_masking
+    ./frequency_masking
+    ./time_masking
diff --git a/docs/augmenter/word/word.rst b/docs/augmenter/word/word.rst
@@ -2,14 +2,13 @@ Word Augmenter
 ==============
 
 .. toctree::
-   :maxdepth: 6
+    :maxdepth: 6
 
-   ./random
-   ./wordnet
-   ./spelling
-   ./tfidf
-   ./word_embs
-   ./context_word_embs
-   ./antonym
-   ./synonym
-   ./split
+    ./antonym
+    ./context_word_embs
+    ./random
+    ./spelling
+    ./split
+    ./synonym
+    ./tfidf
+    ./word_embs
diff --git a/docs/augmenter/word/wordnet.rst b/docs/augmenter/word/wordnet.rst
diff --git a/docs/conf.py b/docs/conf.py
@@ -74,9 +74,9 @@ def __getattr__(cls, name):
 # built documents.
 #
 # The short X.Y version.
-version = '0.0.10'
+version = '0.0.11beta'
 # The full version, including alpha/beta/rc tags.
-release = '0.0.10'
+release = '0.0.11beta'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.

diff --git a/example/audio_augmenter.ipynb b/example/audio_augmenter.ipynb