-
Notifications
You must be signed in to change notification settings - Fork 44
Closed
Labels
good first issueGood for newcomersGood for newcomersoutreachyIssues targeted at Outreachy applicantsIssues targeted at Outreachy applicants
Description
Why it matters
Training logs often contain accented characters or non-English text. Right now Kaun.Dataset.from_text_file ignores its encoding argument and just returns raw bytes. As soon as you point it at UTF-8 or Latin-1 files, you risk broken strings or exceptions, which makes the monitoring dashboard unusable on real datasets.
How to see the gap
Skim kaun/lib/kaun/dataset/dataset.ml, around from_text_file. The function stores encoding in _ and never decodes the file. If you create a small UTF-8 file with emoji and iterate over the dataset, the text comes back as mangled characters.
Your task
- Honor the
encodingparameter infrom_text_file(and the helpers that call it) by decoding each chunk before splitting on newlines. - Add tests in
kaun/test/test_dataset.mlthat cover UTF-8 and Latin-1 snippets so we know the decoding works. - Make sure the default behaviour stays the same when callers do not pass
~encoding.
Tips
- The
Uutflibrary is already available through Raven; it can decode incrementally from a Bigarray-backed string. - Keep the chunked reading logic intact—just convert the bytes to OCaml strings with the right encoding as they arrive.
- Use
Filename.temp_file(already in the test helpers) to build short fixtures that contain characters outside plain ASCII (for example, describe an emoji with its code point).
Done when
- Passing a non-default
~encodingproduces correctly decoded strings. - The dataset tests cover at least one UTF-8 example and one Latin-1 example.
dune runtest kaunpasses after your changes.
Metadata
Metadata
Assignees
Labels
good first issueGood for newcomersGood for newcomersoutreachyIssues targeted at Outreachy applicantsIssues targeted at Outreachy applicants