Update 10/26/23: See https://github.com/facebookresearch/doc-storygen-v2 for a version of the code with prompts supporting using newer chat models (e.g., LLaMA-2, ChatGPT). It follows the same high level structure, but isn't exactly the same behavior in all places (e.g., some pieces are removed for simplicity, and some heuristic checks are no longer necessary); our main goal with the rewite was to make the code easier to work with / modify.
This repo contains code for DOC: Improving Long Story Coherence With Detailed Outline Control (https://arxiv.org/abs/2212.10077, ACL 2023) by Kevin Yang, Dan Klein, Nanyun Peng, and Yuandong Tian. In this codebase we provide instructions for automatically generating longer stories (avg 3500+ words in our paper experiments). DOC's stories are judged by human annotators as substantially more coherent, relevant, and interesting compared to those written by our previous system, Re3 (https://github.com/yangkevin2/emnlp22-re3-story-generation).
(1) Install Python 3.8.15 and PyTorch 1.13.1 (slightly older/newer versions are probably also fine for both).
(2) Install the remaining requirements via pip install -r requirements.txt
. You may need to also run pip install -U sentence-transformers
if you get crashes related to huggingface_hub.snapshot_download
later. If you get some issue with numpy versions, try version 1.22.4.
(3) Install this repo with pip install -e .
.
Also run export OPENAI_API_KEY=$YOUR_API_KEY
in your terminal so that the code can call the GPT3 API with your key.
Meanwhile, run wget https://doc-story-generation-data.s3.amazonaws.com/doc_data.zip
and unzip the folder to the top level of this repo. This folder contains pretrained controller/reranker ckpts and data used to train controllers/rerankers.
To get the final generated stories and Surge AI annotation results from our main experiments run wget https://doc-story-generation-data.s3.amazonaws.com/doc_outputs.zip
(note: some generated stories may contain sensitive/NSFW content, since we didn't attempt to filter these).
We first generate the plan/outline before moving on to the main story.
Example plan generation command matching the settings used for our main paper experiments:
mkdir output
CUDA_VISIBLE_DEVICES=0 python -u scripts/main.py --controller none none none longformer_classifier --loader none none none order --controller-load-dir none none none doc_data/ckpt/outline_order_reranker --controller-model-string none none none roberta-large --no-editor --setup-only --outline-levels 3 --save-outline-file output/plan.pkl --log-file output/plan.log
Code assumes the outline order reranker is in the 4th position of the argument, so don't change those parts of the command. Don't worry if you see some errors being printed, as long as the program doesn't terminate early; some parts might need multiple tries.
This command uses our existing reranker ckpts included in the download. If you want to use your own ckpts, see the instructions further down for training, and change the paths in this command to point to the correct ckpts.
Generating a plan with these settings costs a couple of dollars on GPT3.
Plan generation arguments are compiled in scripts/main.py
; follow the links there to see a complete list. Some particular arguments of interest:
- Specify the
--premise
argument to specify your own story premise instead of having one autogenerated by GPT3. - Change
--outline-levels
to change the maximum depth of the outline. - Set
--outline-char-model-string
to a different InstructGPT3 model (e.g.,text-curie-001
) to save a sizable chunk (if not most) of the GPT3 cost in exchange for slightly worse performance when detecting characters for the outline. - Use
--outline-restart-pkl
to continue generation from a previously-generated lower-depth pkl file. (We use this functionality for our human-interactive experiments.) - Set
--log-level
to be something between 21 and 25 to vary the verbosity of logging (higher = less verbose; defaults to 25).
After generating a plan according to the previous instructions, we can generate the story.
Our main story generation uses OPT-175B served using Alpa (https://alpa.ai/), since it allows token-level logit modification to run controlled generation approaches such as DOC's detailed controller as described in the paper. You have a few options here.
You can ask the Alpa folks for a key to call their free public API at https://opt.alpa.ai/ (slack link at the bottom). They're really nice.
This option may be slower (in runtime) depending on your physical location, since their servers are in the Middle East.
(We need to access the logprobs
endpoint, not the default completions
one.)
Once you have a key, you can specify --alpa-url https://opt.alpa.ai --alpa-key YOUR_KEY
in the main story command below.
If you have the compute, you can request the weights from Meta (https://forms.gle/BDB2i44QwCr2mCJN6) and serve it yourself using Alpa. This is the best (high-quality and reasonable speed) option if you can do it.
Follow the installation and serving instructions at https://alpa.ai/install.html and https://alpa.ai/tutorials/opt_serving.html respectively. The newest version of Alpa should work, but we also froze the version we used at https://github.com/yangkevin2/doc-alpa in case it's useful.
Once you have it set up, specify --alpa-url YOUR_SERVER_URL
in the main story command below (e.g., in the format http://0.0.0.0:8001
).
Alternatively you can use a smaller OPT model, though this will result in noticeably worse quality.
Just use GPT3-175B instead, which means turning off our detailed controller. You will on average get noticeably worse faithfulness to the plan/outline, but it'll be quite a bit faster.
To do this, set --extension-method gpt3
in the main story command below. This will use the base davinci
model (i.e., not one of the instruction-tuned GPT3.5/GPT4 models, which use a different prompting interface and aren't currently supported; these instruction-tuned models also often write in a somewhat different style).
It's not too expensive as far as the GPT3 API is concerned; you'll probably spend less than a dollar over the course of the story.
After setting up your OPT-175B (or other) server, run the following to draft the story using the same settings as in our main paper experiments, making sure to append the extra Alpa-related (or other) arguments described above.
CUDA_VISIBLE_DEVICES=0 python -u scripts/main.py {{{ALPA_ARGS}}} --controller longformer_classifier longformer_classifier fudge_controller --loader alignment coherence fine_coherence --controller-load-dir doc_data/ckpt/relevance_reranker doc_data/ckpt/coherence_reranker doc_data/ckpt/detailed_controller --controller-model-string allenai/longformer-base-4096 allenai/longformer-base-4096 facebook/opt-350m --load-outline-file output/plan.pkl --no-editor --include-future-context --control-strength 1 1 0 --control-strength-substep-increment 3 --save-complete-file output/story.pkl --log-file output/story.log
The command assumes all 3 rerankers/controllers are present in the specified order, so don't change those arguments.
This command uses our existing reranker ckpts included in the download. If you want to use your own ckpts, see the instructions further down for training, and change the paths in this command to point to the correct ckpts.
Although this command still uses GPT3 to write some summaries for prompting, the costs are on the order of a few cents.
Main story generation arguments are also compiled in scripts/main.py
; follow the links there to see a complete list. Some particular arguments of interest:
- Change
--max-continuation-substeps
(defaults to 8) and--max-tokens
(defaults to 64) to change how much maximum story text to write for each numbered item of the outline. With the default settings, it will write up to eight 64-token passages for each. - Change
--early-stop-threshold
and--skip-threshold
to mess with the early stopping heuristics for moving drafting to the next outline item. Smaller (more negative) values of--early-stop-threshold
will result in more aggressive early stopping. Larger (less negative) values of--skip-threshold
will result in more frequently skipping directly to the next outline item when all generated passage candidates aren't very good. --control-strength
has three numbers corresponding to the relevance reranker, coherence reranker, and detailed controller respectively. The detailed controller's control strength increases over time according to--control-strength-substep-increment
up to--max-control-strength
while drafting for a given outline item, resetting when we move to the next outline item. We think the current settings are a reasonable balance of control vs. letting the model be creative, but feel free to tweak. To turn off the detailed controller just use--control-strength-substep-increment 0
.- The frequency and prompt repetition penalties during generation are set to 1 (with 0.98 exponential decay per token). You can change
--summarizer-frequency-penalty
,--summarizer-prompt-penalty
, and--summarizer-frequency-penalty-decay
respectively. Other arguments related to the base generator are instory-generation/common/summarizer/summarizer_util.py
. - If you have a high-depth outline and you want to generate using lower depth (e.g. convert a depth 3 outline to a depth 2 outline), specify
--generation-outline-levels
. - Increase
--max-beam-size
(defaults to 1) to turn on a passage-level variable-size beam search procedure based on the rerankers. This is off for the paper experiments (makes the system several times slower). - If you run out of GPU memory you can try decreasing
--fudge-batch-size
to e.g. 32 (or less), or retrain smaller rerankers/controllers according to the instructions at the bottom of the README. - Remove
--no-editor
to turn off the Edit module inherited from Re3 (not heavily tested; DOC doesn't use it in our main experiments) - Set
--log-level
to be something between 21 and 25 to vary the verbosity of logging (higher = less verbose; defaults to 24).
Using very small OPT models could lead to crashes since we didn't extensively test the edge cases where all the generated continuations get rejected by our filters (you can set --skip-threshold -10000
to avoid this happening). This may sometimes happen with GPT3 as well when the detailed controller is off. This crash never happened in our main experiments using OPT-175B.
Baselines assume you already have a plan generated by our code according to the command described earlier.
Take the plan generated by our code (output/plan.pkl
in the commands below) and save just the setting/characters and top-level outline for use in Re3:
python scripts/data/save_re3_plan.py -i output/plan.pkl -o output/re3_plan.pkl
Then follow the instructions in https://github.com/yangkevin2/emnlp22-re3-story-generation. Run with OPT-175B for fair comparison, using --extension-method opt
; the Alpa arguments are the same as in this repo. Specify the already-generated plan file using --load-outline-file
. You'll want to also set --max-candidates 8 --summarizer-frequency-penalty 1 --summarizer-prompt-penalty 1
as well as --max-continuation-substeps 5
to roughly match the length of stories we generate in our main experiments.
python -u scripts/rolling_baselines.py {{{ALPA_ARGS}}} --load-outline-file output/plan.pkl --extension-method opt --save-complete-file output/rolling_opt_story.pkl > output/rolling_opt_story.log
python -u scripts/rolling_baselines.py --load-outline-file output/plan.pkl --extension-method gpt3 --save-complete-file output/rolling_gpt3_story.pkl > output/rolling_gpt3_story.log
Set the checkpoint save directory and run the command below. The training data (derived from InstructGPT-13B summaries of passages from WritingPrompts (Fan et al 2018)) is provided in the data download.
CUDA_VISIBLE_DEVICES=0 python scripts/training/train_controller.py --controller-save-dir {{{SAVE_DIRECTORY}}} --controller fudge_controller --controller-model-string facebook/opt-350m --data-dir doc_data/training_data/detailed_controller_training_data.csv --dataset alignment --loader fine_coherence --batch-size 2 --lower-length-limit 1000 --controller-epochs 20 --num-workers 8 --controller-num-negatives 3 --controller-lr 1e-6 --coherence-negative-categories other shuffle repeat --limit 100000
Set the checkpoint save directory and run the command below. The training data (some very brief, outline-like stories generated from InstructGPT3-175B) is provided in the data download.
CUDA_VISIBLE_DEVICES=0 python scripts/training/train_controller.py --controller-save-dir {{{SAVE_DIRECTORY}}} --controller longformer_classifier --controller-model-string roberta-large --data-dir doc_data/training_data/order_training_data.csv --dataset csv --csv-column story --loader order --batch-size 64 --controller-epochs 20 --controller-lr 1e-5 --limit 100000 --num-workers 8
If you want to retrain the relevance and coherence rerankers yourself, follow the instructions in https://github.com/yangkevin2/emnlp22-re3-story-generation, since ours are unchanged from theirs.