A natural language generation system to generate technical docs
This code uses OpenAi gpt-2 and also finetuning by nshepperd
Things you can or should do before training.
Available are Modells "117M" and "354M" (not tested) adjust output_dir in script!
python 1Preprocessing\download_model.py 117M
Be careful we added a Language-identifier to the Hyperparams in h_params.json. Please add "h_params" = "en"
Use subword-nmt by Rico Sennrich to create new Byte Pair Encoding for your Language.
- Place a .txt File you want to extract embeddings from in data/embedding
- start process with
subword-nmt learn-joint-bpe-and-vocab --input data/embedding/yourfile.txt --output data/embedding/vocab.txt --write-vocabulary data/embedding/encoder.txt --separator Ġ --symbols 50257 -v
- Reformat Output so it fits gpt-2
python 1Preprocessing/format_embeddings.py
- Move encoder.json and vocab.bpe to your base language-model in directory models
- Place PDFs in training/PDF
- Clean PDFs with own rules (regex, str.replace) in pdf_to_txt.py
- Use pdf_to_txt.py to parse PDFs to txt-File (with Apache Tika)
python 1Preprocessing/pdf_to_txt.py
If you don't want to encode your Trainingdata on every run, you can save it encoded with numpy savez and load from that file.
python 1Preprocessing\pre_encode.py .\data\training\PDF .\data\training\trainingsdaten.npz --model_name ISW_Model
Steps to train your own model
-
We recommend to parse your file into single .txt (see Preprocessing)
-
Pre-Encode to npz (recommended see Preprocessing)
-
download model to retrain and rename it
-
Create Embeddings (encoder.json and vocab.bpe) for your language (optional, not recommended because of Problems)
-
replace encoder and vocab files
-
start retraining with:
python 2Training/train.py --dataset ./data/training/trainingsdaten.npz --model_name ISW_Model --sample_every 100 --sample_length 200 --run_name iswTrain1
-
wait
-
if you are satisfied with samples (data/training/samples) and loss: stop (ctrl+c)
-
get newest checkpoint from data/training/checkpoint/runX
-
replace the following files in your model with the new ones
* checkpoint * model.ckpt.data-00000-of-00001 * model.ckpt.index * model.ckpt.meta
-
Adjust hparams.n_lang to your language
-
your model is ready to use. If you want to see some stats on tensorboard use:
tensorboard --logdir=data/training/iswTrain1/checkpoint
See what the Backend does
Uses Huggingface pytorch-Transformer
Communicates with Frontend via REST-API on Flask server.
Receives supporting Words which the Text should contain and Settings from User-Input in Frontend.
Implements 4 different Strategies to build Text from given supporting Words:
1a. Beam-Search
1b. Beam-Search with Scope
2.Search until fit
3.Cut-off and insert
4.BERT-GPT2 Hybrid
For Details see paper.
How to use the Word-Add-In
Generated with Yeoman-Generator for Office-Add-ins
For changes edit React App in 4Frontend/src/taskpane/components
To sideload your Add-In in Word use the following command inside of directory 4Frontend
npm start
and
npm stop
Open Start > Show TDTG > Enter your Inputs and Settings > click generate
Wenn Änderungen, die Sie am Manifest vorgenommen haben, z. B. Dateinamen von Symbolen für Schaltflächen im Menüband anscheinend nicht wirksam werden, löschen Sie den Office-Cache auf Ihrem Computer. Löschen des Inhalts des Ordners %LOCALAPPDATA%\Microsoft\Office\16.0\Wef\
Be careful with spaces in your Pathname, they are followed by errors with webpack loading the CA-Certificate and block sideloading your Add-In