Skip to content

Teach Kaun's text reader about encodings #108

@tmattio

Description

@tmattio

Why it matters

Training logs often contain accented characters or non-English text. Right now Kaun.Dataset.from_text_file ignores its encoding argument and just returns raw bytes. As soon as you point it at UTF-8 or Latin-1 files, you risk broken strings or exceptions, which makes the monitoring dashboard unusable on real datasets.

How to see the gap

Skim kaun/lib/kaun/dataset/dataset.ml, around from_text_file. The function stores encoding in _ and never decodes the file. If you create a small UTF-8 file with emoji and iterate over the dataset, the text comes back as mangled characters.

Your task

  • Honor the encoding parameter in from_text_file (and the helpers that call it) by decoding each chunk before splitting on newlines.
  • Add tests in kaun/test/test_dataset.ml that cover UTF-8 and Latin-1 snippets so we know the decoding works.
  • Make sure the default behaviour stays the same when callers do not pass ~encoding.

Tips

  • The Uutf library is already available through Raven; it can decode incrementally from a Bigarray-backed string.
  • Keep the chunked reading logic intact—just convert the bytes to OCaml strings with the right encoding as they arrive.
  • Use Filename.temp_file (already in the test helpers) to build short fixtures that contain characters outside plain ASCII (for example, describe an emoji with its code point).

Done when

  • Passing a non-default ~encoding produces correctly decoded strings.
  • The dataset tests cover at least one UTF-8 example and one Latin-1 example.
  • dune runtest kaun passes after your changes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    good first issueGood for newcomersoutreachyIssues targeted at Outreachy applicants

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions