Question regarding error metrics/dataset creation

I had a few questions/clarifications regarding the hdf5 dataset that was linked on the notebook:
1. I ran the notebook for training from scratch using the existing hdf5 and obtained a CER of ~0.09 using just a single model (and not an ensemble).
2.  When creating the hdf5 from scratch and running the training procedure my CER is similar to the best/second best models (~0.16-0.18).

So, as far as I can see the main difference would be in the dataset generation/preprocessing steps or the tokenizer:
a. In the notebook there's a comment that the pretained models used a vocab size of 100 as opposed to 99 (95 characters + SOS/EOS/PAD/UNK tokens)- is there an additional token used here?
b. Was the generation procedure of the hdf5 that was linked/on the google drive a little different?

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding error metrics/dataset creation #15

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question regarding error metrics/dataset creation #15

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions