This project provides a set of utilities to assist users in fine-tuning the ChatGPT 3.5 model with OpenAI.
The utilities are wrapped into a single TrainGPT
class which allows users to manage the entire fine-tuning lifecycle - from uploading data files, to starting training jobs, monitoring their progress, and managing the trained models.
I was using a collection of curl commands to "interact" with OAI API and it went out of control, so I started to group things together. I work a lot from the interactive Python console to test and "play" with things, so having things grouped up helps. I also plan to release the other collections for dealing with inference for custom models and managing the assests (fiels, embeddings, etc)
- File Upload: Easily upload your fine-tuning data files.
- File List: See all your files (Uploaded and results of previous trainings).
- File Details: Get file details.
- Count tokens: Count tokens with tiktoken library.
- Start Training: Begin a new training job using your uploaded data.
- List Jobs: View all your current and past training jobs.
- Job Details: Retrieve detailed information about a specific training job.
- Cancel: Cancel a training job.
- Delete: Delete a training job.
- List Models: View all your current and past fine-tuned models, filtered per your models and standard models
- List Models Summaries: View all your models, grouped per owner.
- Model Details: Retrieve detailed information about a specific model.
- Delete Model: Delete a fine-tuned model.
The code contains a
get_token_count()
method that will count the tokens from the training file using tiktoken
library.It will use 3 available encoders: "cl100k_base", "p50k_base", "r50k_base" and will show the results for each one.
YOU WILL BE CHARGED ABOUT 10 TIMES THAT NUMBER OF TOKENS. So, if you have 100k tokens returned by the get_token_count()
method, you will be charged for 1M tokens.
I was wrong here. There is an overhead, but is not alwats 10x For small files (100, 500, 1000, 2000 tokens), trained tokens are 15k+, It seems you can't go bellow 15k tokens, no matter how small is your training file.
For bigger files, the overhead is still there, but lower. For a file with 3 920 281 tokens, trained tokens were 4 245 281, so the overhead is around 6%. For a file with 40 378 413 counted tokens, trained tokens were: 43 720 882.
There is an overhead that will be 10x on very small files, but it gets to bellow 10% on larger files
Here is a quick table with the overhead at different token levles:
Number of tokens in the training file | Number of charged tokens | Overhead |
---|---|---|
1 426 | 15 560 | 1091% |
3 920 281 | 4 245 281 | 8.29% |
40 378 413 | 43 720 882 | 8.27% |
92 516 393 | File exceeds maximum size of 50000000 tokens for fine-tuning | |
46 860 839 | 48 688 812 | Here they removed some rows as "moderation" |
25 870 859 | 26 903 007 | 9.61% |
41 552 537 | 43 404 802 | 9.54% |
It seems that there is a limitation to 50 000 000 tokens
- API Key: Ensure you have set up your OpenAI API key. You can set it as an environment variable named
OPENAI_API_KEY
.
export OPENAI_API_KEY="your_api_key"
-
Clone the Repository:
git clone https://github.com/your_username/chatgpt-fine-tuning-utilities.git cd chatgpt-fine-tuning-utilities
-
Install Dependencies:
pip install -r requirements.txt
Data needs to be in JSONL format:
[
{
"messages": [
{ "role": "system", "content": "You are an assistant that occasionally misspells words" },
{ "role": "user", "content": "Tell me a story." },
{ "role": "assistant", "content": "One day a student went to schoool." }
]
},
{
"messages": [
{ "role": "system", "content": "You are an assistant that occasionally misspells words" },
{ "role": "user", "content": "Tell me a story." },
{ "role": "assistant", "content": "One day a student went to schoool." }
]
}
]
Save it as data.jsonl
in the root directory of the project.
After setting up, you can utilize the TrainGPT
class in your Python scripts as follows:
-
Initialization:
Start by importing and initializing the
TrainGPT
class.from train_gpt_utilities import TrainGPT trainer = TrainGPT()
-
Upload Training Data:
Upload your training data file to start the fine-tuning process.
trainer.create_file("path/to/your/training_data.jsonl")
-
Start a Training Job:
Begin the training process using the uploaded file.
trainer.start_training()
-
Listing All Jobs:
You can list all your current and past training jobs.
jobs = trainer.list_jobs()
You will get something like this:
trainer.list_jobs() There are 1 jobs in total. 1 jobs of fine_tuning.job. 1 jobs succeeded. List of jobs (ordered by creation date): - Job Type: fine_tuning.job ID: ftjob-Sq3nFz3Haqt6fZwqts321iSH Model: gpt-3.5-turbo-0613 Created At: 2023-08-24 04:19:56 Finished At: 2023-08-24 04:29:55 Fine Tuned Model: ft:gpt-3.5-turbo-0613:iongpt::7qwGfk6d Status: succeeded Training File: file-n3kU9Emvvoa8wRrewaafhUv
When the status is "succeeded" you should have your model ready to use. You can jump to step 7 to find the fine tuned model.
If you have multiple jobs in the list, you can use the id to fetch the details of a specific job.
-
Fetching Job Details:
You can get detailed statistics of a specific training job.
job_details = trainer.get_job_details("specific_job_id")
If something goes wrong, you can cancel a job using
-
Cancel a Job:
You can cancel a training job if it is still running.
trainer.cancel_job("specific_job_id")
-
Find the fine tuned model: For this we will use the list_models_summaries method.
models = trainer.list_models_summaries()
You will get something like this:
You have access to 61 number of models. Those models are owned by: openai: 20 models openai-dev: 32 models openai-internal: 4 models system: 2 models iongpt: 3 models
Then, you can use the owner to fetch the details of models from specific owner. The fine tuned model will be in that list.
-
trainer.list_models_by_owner("iongpt")
You will get something like this:
Name: ada:ft-iongpt:url-mapping-2023-04-12-17-05-19
Created: 2023-04-12 17:05:19
Owner: iongpt
Root model: ada:2020-05-03
Parent model: ada:2020-05-03
-----------------------------
Name: ada:ft-iongpt:url-mapping-2023-04-12-18-07-26
Created: 2023-04-12 18:07:26
Owner: iongpt
Root model: ada:2020-05-03
Parent model: ada:ft-iongpt:url-mapping-2023-04-12-17-05-19
-----------------------------
Name: davinci:ft-iongpt:url-mapping-2023-04-12-15-54-23
Created: 2023-04-12 15:54:23
Owner: iongpt
Root model: davinci:2020-05-03
Parent model: davinci:2020-05-03
-----------------------------
Name: ft:gpt-3.5-turbo-0613:iongpt::7qy7qwVC
Created: 2023-08-24 06:28:54
Owner: iongpt
Root model: sahara:2023-04-20
Parent model: sahara:2023-04-20
-----------------------------
This part was not tested yet. Please use the Python script usage for now. Recommended to use from a python interactive shell.
-
Uploading a File:
python train_gpt_cli.py --create-file /path/to/your/file.jsonl
-
Starting a Training Job:
python train_gpt_cli.py --start-training
-
Listing All Jobs:
python train_gpt_cli.py --list-jobs
For any command that requires a specific job or file ID, you can provide it as an argument. For example:
python train_gpt_cli.py --get-job-details your_job_id
- Add support for inference on the custom fine tune models
- Add suport for embeddings
We welcome contributions to this project. If you find a bug or want to add a feature, feel free to open an issue or submit a pull request.
This project is licensed under the MIT License. See LICENSE for more details.