This repository contains the current snapshot of the OpenChatKit bot. You can find all training data in data
,
the hyperparameters used for training in training.yaml
, training log in training_log
,
and the pointer to the model at model.yaml
.
You can find in different branches different specialized versions of this bot.
You can make it better by contributing data!
How should we think about the training data for OpenChatKit bots? A training set is a set of slices,
where each slice contains a set of (input, output) pairs. Each slice corresponds to one file
in the data
folder.
For example, if the data folder contains
data
|- pile.yaml
|- soda.yaml
during training, the training set will contain the union of both pile
and soda
.
Note that different slices can be weighted differently, which will be specified in
the file training.yaml
(see "Model Training" for details)
You can provide data in various formats.
- You can provide a collection of input/output pairs
IOPairs:
- input: INPUT TEXT STRING
output: OUTPUT TEXT STRING
- input: INPUT TEXT STRING
output: OUTPUT TEXT STRING
...
or pure text
Text:
- text: TEXT STRING
- text: TEXT STRING
...
- You can provide us the link to your dataset on HuggingFace
HuggingFace:
- link: LINK TO YOUR DATASET
- You can prepare your dataset as in OpenAI jsonl format (https://platform.openai.com/docs/guides/fine-tuning)
and put it in a link that we can
wget
orcurl
OpenAIJsonl:
- link: LINK TO YOUR DATASET
Each merged pull request will trigger (currently manually) to the training of a model.
Hyper-parameters, including the specific mixture of data, will be specified in training.yaml
:
Training:
- lr: 0.0001
- momentum: 0.99
Mixture:
- pile: 0.5
- soda: 0.5
After training, a file training_log
will be committed to the repository. And a file
model.yaml
will be made available in the repository specifying where to find this model
and (optionally) Together API end-point to query such a model.
You can help us to make OpenChatKit better in three ways.
If you realize that the bug is not performing well, please open an issue, specifying your input, the bot's output, and a description of what is wrong with it (potentially with the right answer).
If you have data that you believe could be useful to fix some of the issues, please
add your data into the data
folder and make a pull request associated with the issue
that you think this will fix.
We will review these pull requests, train a model, and merge them.
You don't have to always merge into the main branch. If you have specific things to
try out (e.g., a text2sql
bot), feel free to open a new branch work there!
Let's work together to make the best open-source bot!