dataset stores the all kinds of datasets during my experiments:
- UniMER-1M and HME100K to merge to UniMER-1M_merged for training
- UniMER-Test for test
- simple for debugging
- hf_datasets for making it to hf format and push to hf.
tokenizer stores the handicraft tokenizer I made for LaTeX (more precisely for KaTeX) commands.
unimernet_tokenizer stores the original tokenizer used by UniMERNet and PPFormulaNet
unimernet_tokenizer_distill stores the distilled version of the above one.