Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation.
The main depedencies can be installed via pip install -r requirements.txt
.
The main code is run through main.py
. Check out --help
for full list of commands.
python main.py --help
The code will automatically use the first GPU device, if detected.
A typical command to run BERT-base 10 times on the 1% subsample set of the SST-2 dataset and computing the average of all run is as follows.
python main.py --datasets sst2 \
--train-subsample 0.01f \
--classifier transformers \
--model-name bert-base-uncased \
--num-trials 1 \
--augmenter none \
--save-dir out
The script will create a directory named out
in the current working directory and save the script log
as out/run.log
. It will also save any augmentations created during the experiments (if any augmentation is enabled).
To test GPT3Mix, prepare an OpenAI API key as described at the bottom of this README file, then use the following command:
python main.py --datasets sst2 \
--train-subsample 0.01f \
--classifier transformers \
--model-name bert-base-uncased \
--num-trials 1 \
--augmenter gpt3-mix \
--save-dir out
In the command above, the script will automatically generate seeds for sampling data and optimizing models.
The seed used to generate each individual seed is called "master seed" and can be set using --master-data-seed
and --master-exp-seed
options. As evident from the option names, they are responsible for sampling data and
optimizing a freshly initialized models respectively.
Sometimes, we need to manually set the seeds and not rely on automatically generated seeds from the master seeds.
Manually seeding can be achieved via --data-seeds
option. If this option is given, the master data seed will
be ignored. We only support manualy data seeding for now.
Store OpenAI API Key under the current working directory as a file named openai-key
.
When running the main script, it will automatically detect the api key.
API keys can be provided to the script by --api-key
option (not recommended) or from a file named openai-key
in the current working directory.
At the moment we only support data augmentation leveraging OpenAI GPT-3 (GPT3Mix), but we will release an update that supports HyperCLOVA as soon as it becomes available to the public (HyperMix).
To cite our code or work, please use the following bibtex:
@inproceedings{yoo2021gpt3mix,
title = "GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation",
author = "Yoo, Kang Min and
Park, Dongju and
Kang, Jaewook and
Lee, Sang-Woo and
Park, Woomyoung",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
month = nov,
year = "2021",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-emnlp.192",
pages = "2225--2239",
}
HyperMix
Copyright 2021-present NAVER Corp.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.