Open
Description
I generated a parquet file by using the tokenize_data.py
script in the together-python
repo to generate a parquet file with tokenized data. I followed the steps here https://docs.together.ai/docs/fine-tuning-data-preparation#tokenized-data
python tokenize_data.py --tokenizer openai-community/gpt2 --out-filename dataset.parquet
I then tried to upload this dataset to together:
import { upload } from 'together-ai/lib/upload'
await upload('dataset.parquet')
"failed to read parquet file dataset.parquet"
With a bit of added logging in together-typescript
I found the more specific error message:
invalid parquet version
Which is coming from inside the parquetjs
library which is woefully unmaintained in 5+ years and does not support the majority of modern parquet files. #102 also ran into a parquet parsing issue. Might be worth using a different parquet library.
Metadata
Metadata
Assignees
Labels
No labels