catalogue_data

Scripts to prepare catalogue data.

Setup

Clone this repo.

Install git-lfs: https://github.com/git-lfs/git-lfs/wiki/Installation

sudo apt-get install git-lfs
git lfs install

Install dependencies:

sudo apt-add-repository non-free
sudo apt-get update
sudo apt-get install unrar

Create virtual environment, activate it and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Create User Access Token (with write access) at Hugging Face Hub: https://huggingface.co/settings/token and set environment variables in the .env file at the root directory:

HF_USERNAME=<Replace with your Hugging Face username>
HF_USER_ACCESS_TOKEN=<Replace with your Hugging Face API token>
GIT_USER=<Replace with your Git user>
GIT_EMAIL=<Replace with your Git email>

Create metadata

To create dataset metadata (in file dataset_infos.json) run:

python create_metadata.py --repo <repo_id>

where you should replace <repo_id>, e.g. bigscience-catalogue-lm-data/lm_ca_viquiquad

Aggregate datasets

To create an aggregated dataset from multiple datasets, and save it as sharded JSON Lines GZIP files, run:

python aggregate_datasets.py --dataset_ratios_path <path_to_file_with_dataset_ratios> --save_path <dir_path_to_save_aggregated_dataset>

where you should replace:

path_to_file_with_dataset_ratios: path to JSON file containing a dict with dataset names (keys) and their ratio (values) between 0 and 1.
<dir_path_to_save_aggregated_dataset>: directory path to save the aggregated dataset

Downloads for cleaning

Stanza

import stanza

for lang in {"ar", "ca", "eu", "id", "vi", "zh-hans", "zh-hant"}:
    stanza.download(lang, logging_level="WARNING")

Indic NLP library

git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git
export INDIC_RESOURCES_PATH=<PATH_TO_REPO>

NLTK

import nltk nltk.download("punkt")

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
clean_helpers		clean_helpers
download_scripts		download_scripts
metadata_analysis		metadata_analysis
pii		pii
slurm		slurm
streamlit-app		streamlit-app
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
aggregate_datasets.py		aggregate_datasets.py
build_txt_file.py		build_txt_file.py
catalogue_data.json		catalogue_data.json
clean.py		clean.py
clone_repos.py		clone_repos.py
create_metadata.py		create_metadata.py
format_meta.py		format_meta.py
get_dataset_name.py		get_dataset_name.py
make_ratios_thomas.ipynb		make_ratios_thomas.ipynb
make_ratios_yacine.ipynb		make_ratios_yacine.ipynb
meta_table.csv		meta_table.csv
requirements.txt		requirements.txt
tokenizer_dataset_ratios.json		tokenizer_dataset_ratios.json
training.csv		training.csv
training_dataset_ratios.json		training_dataset_ratios.json
training_v2.csv		training_v2.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

catalogue_data

Setup

Create metadata

Aggregate datasets

Downloads for cleaning

Stanza

Indic NLP library

NLTK

About

Releases

Packages

Contributors 8

Languages

License

bigscience-workshop/catalogue_data

Folders and files

Latest commit

History

Repository files navigation

catalogue_data

Setup

Create metadata

Aggregate datasets

Downloads for cleaning

Stanza

Indic NLP library

NLTK

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages