DatasetForge ⚒️

DatasetForge is a Python project designed to extract data from Google Sheets and convert it into JSONL formatted dataset, which is suitable for fine-tuning (davinci-002 model) tasks (OpenAI). This tool also uses the library called tiktoken to estimate the cost of fine-tuning (davinci-002 model) tasks.

Requirements ⭐

You must have Google Sheets data that is represented in a prompt-completion (legacy) structure.

Refer to sheets_sample.ods for details
You must create a Google Service Account in Google Cloud Platform.
You must enable the Google Sheets API for that Google Service Account.
You must have the credentials for that Google Service Account.

How to Run the Project 🏃🏽‍♂️

Step 1: Clone the repo

Open Git bash and type:

  git clone https://github.com/farithadnan/DatasetForge.git

Step 2: Installation

Install the required Python packages by running below command on your terminal:

  pip install -r requirements.txt

Step 3: Set Up Google Sheets Config

Ensure that the configuration file (e.g., config.yaml) contains essential settings such as:

Path to Google Sheets credentials file (private keys).
URL of the Google Sheet to extract data from.
Index of the specific sheet within the Google Sheet.
Name for the output JSONL file.

Refer to a file called config.yaml.sample for more info.

Step 4: Set up model for Encoding

To estimate the cost of your dataset when it is fine-tuned later, you need to configure the encoding in config.yaml. By default, it is configured to r50k_base encoding, which refers to GPT-3 models like (davinci-002).

For more details, refer to How to count tokens with tiktoken

Step 5: Run the Project

Activate your virtual environment then run the main python script:

python app.py

This will authenticate with Google Sheets, extract the specified data, and convert it into a JSONL format, creating a dataset ready for fine-tuning tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
config		config
.gitignore		.gitignore
README.md		README.md
app.py		app.py
converter.py		converter.py
extractor.py		extractor.py
requirements.txt		requirements.txt
sheets_sample.ods		sheets_sample.ods

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DatasetForge ⚒️

Requirements ⭐

How to Run the Project 🏃🏽‍♂️

About

Contributors 2

Languages

farithadnan/DatasetForge

Folders and files

Latest commit

History

Repository files navigation

DatasetForge ⚒️

Requirements ⭐

How to Run the Project 🏃🏽‍♂️

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages