Skip to content

Question regarding error metrics/dataset creation #15

@nsrishankar

Description

@nsrishankar

I had a few questions/clarifications regarding the hdf5 dataset that was linked on the notebook:

  1. I ran the notebook for training from scratch using the existing hdf5 and obtained a CER of ~0.09 using just a single model (and not an ensemble).
  2. When creating the hdf5 from scratch and running the training procedure my CER is similar to the best/second best models (~0.16-0.18).

So, as far as I can see the main difference would be in the dataset generation/preprocessing steps or the tokenizer:
a. In the notebook there's a comment that the pretained models used a vocab size of 100 as opposed to 99 (95 characters + SOS/EOS/PAD/UNK tokens)- is there an additional token used here?
b. Was the generation procedure of the hdf5 that was linked/on the google drive a little different?

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions