Parquet upload fails

I generated a parquet file by using the `tokenize_data.py` script in the `together-python` repo to generate a parquet file with tokenized data. I followed the steps here https://docs.together.ai/docs/fine-tuning-data-preparation#tokenized-data

```
python tokenize_data.py --tokenizer openai-community/gpt2 --out-filename dataset.parquet
```

I then tried to upload this dataset to together:

```javascript
import { upload } from 'together-ai/lib/upload'
await upload('dataset.parquet')
"failed to read parquet file dataset.parquet"
```

With a bit of added logging in `together-typescript` I found the more specific error message:

```
invalid parquet version
```

Which is coming from inside the `parquetjs` library which is woefully unmaintained in 5+ years and does not support the majority of modern parquet files. #102 also ran into a parquet parsing issue. Might be worth using a different parquet library.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parquet upload fails #104

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parquet upload fails #104

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions