Skip to content

feat: Add built-in CSV loader #19167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mastermakrela
Copy link

@mastermakrela mastermakrela commented Apr 21, 2025

What does this PR do?

Bun already supports natively importing various file types, which makes quick scripting much easier. However, one commonly used and straightforward format—CSV—was missing. CSV is a ubiquitous, very basic format1, and having built-in support for it would be a helpful addition.

This pull request adds new loaders that allow importing CSV files as JavaScript arrays of records (objects) or arrays. This implementation is minimal and slightly constrained due to the current limitations in passing import options2. For example, it's not yet possible to do:

import table from "./data.csv" with { type: "csv", header: "false" };

I've based this implementation on the official CSV specification (RFC 4180), and extended it to also support TSV (tab-separated values) files.

Design Choices and Rationale

One design decision worth noting is the inclusion of four new loaders, rather than just a single csv loader. Originally, I intended to provide one generic loader. However, due to the current architecture—which doesn't allow accessing import options from within the loader itself (see this relevant section of the code)—it made more sense to cover the most common use cases explicitly.

These loaders handle two variables:

  • The delimiter: either a comma (, for CSV) or a tab (\t for TSV)
  • The presence of a header row: either true (default) or false

By covering these combinations, we support the most typical use cases out of the box.

To enable this, I've added multiple new module types in packages/bun-types/extensions.d.ts. The type of the default export depends on the presence of a header row:

  • If headers are present: the loader returns an array of objects
  • If headers are absent: it returns an array of arrays

To distinguish these cases, I’ve used the ?no_header query string (e.g., data.csv?no_header). This approach works because:

  • It’s currently the only TypeScript-compatible way to define distinct types for the same file extension
  • The query string is otherwise ignored during import, making it a potential candidate for future enhancements

Edge Case: Empty File Import

While writing tests, I encountered a bug related to importing empty files: Issue #19164. Currently, importing an empty file results in an empty object. However, for CSV and TSV, I believe the correct behavior should be to return an empty array, as the default export.


Checklist

  • Code changes
  • Documentation or TypeScript types (not required for this PR)

How did you verify your code works?

I’ve written tests for CSV and TSV imports, following the pattern of the existing TOML import tests.

If Zig files changed:

  • I verified memory lifetimes (allocation and deallocation) where applicable
  • I included tests for the new code, or existing tests cover the changes
  • I wrote TypeScript/JavaScript tests, and they pass locally using bun-debug test test-file-name.test

I'm still new to Zig, so I haven’t yet verified the memory handling manually. If someone can guide me on how to do that, I’d love to learn!


This is my first contribution to Bun, so feedback is very welcome. Please let me know if I’ve missed anything, done something incorrectly, or should add more context or documentation.

It should also address this issue: #6722


Footnotes

  1. The format is so old it doesn't really change anymore, so once the parser is working it should require no further work in the future, meaning it should be a net positive (more features; no new stuff to maintain). Of course, in a long run, one could think about iterators, streaming from disk, SIMD and other optimizations, but for not having to install anything is more than enough :)

  2. I spent around 10 hours exploring how import options might be accessed within the loader—see this section of the source. From what I understand, parsing and transpiling are currently decoupled, and the loader is chosen based solely on the file extension. That makes it difficult to pass custom import options to the parser. This might be worth discussing or exploring in a future PR.

@kravetsone
Copy link

If sometimes Bun will be good in Jupiter notebooks it would be awesome feature

@Jarred-Sumner
Copy link
Collaborator

Very exciting. Thank you for this.

Initial thoughts:

  • How do PappaParse and other CSV parsers handle leading/trailing quotes and whitespace both between cells and within cells ? Do they handle non-ascii newlines? If yes; we should assume that we need to as well and that means using strings.CodepointIterator instead of iterating byte by byte
  • Can you add about 50 more tests for various cases involving headers, no headers, trailing whitespace, leading whitespace, inconsistent number of commas?
  • what is in the test suite of other CSV parsers that we should copy?

@mastermakrela
Copy link
Author

@Jarred-Sumner thank you for the quick feedback :D

  • Can you add about 50 more tests for various cases involving headers, no headers, trailing whitespace, leading whitespace, inconsistent number of commas?

Will do 🫡

  • How do PappaParse and other CSV parsers handle leading/trailing quotes and whitespace both between cells and within cells ? Do they handle non-ascii newlines? If yes; we should assume that we need to as well and that means using strings.CodepointIterator instead of iterating byte by byte
  • what is in the test suite of other CSV parsers that we should copy?

I don't know the answers directly, but I'll try to find some time this week to do more research.

@mastermakrela
Copy link
Author

Parsing

Both my go to CSV library and the creator of PapaParse agrees with the RFC, that leading/trailing whitespace is part of the field

Unfortunately, the CSV spec specifically says: "Spaces are considered part of a field and should not be ignored." - if your CSV files are created with spaces after the commas, then the spaces are errors in the input and the generator needs to be fixed.
~ mholt/PapaParse#241 (comment)

There was a discussion
if there should exit an option to trim the whitespace, but it was decided against it.
Someday it could be behind a flag.

AFAICT, all JS based CSV parsers support Unicode, so there is no reason why we shouldn't - I'll update the code to use strings.CodepointIterator.
That also means we should support all known types of line breaks:

  1. ASCII line breaks:

    • \n (LF, Line Feed, U+000A)
    • \r (CR, Carriage Return, U+000D)
    • \r\n (CRLF, Windows-style line endings)
  2. Non-ASCII Unicode line breaks:

    • U+0085 (NEL, Next Line)
    • U+2028 (LS, Line Separator)
    • U+2029 (PS, Paragraph Separator)

I'll stay consistent with the RFC, just allow any of those symbols at the place where RFC uses CRLF.

Another feature present in other parsers is dynamic typing (dynamicTyping in papaparse ; infer in csv-simple-parser), which automatically parses the fields into JS types that "make sense" 1.

I think we should skip all nice to haves at least until we can pass options to imports - otherwise the number of loaders will become unmanageable (delimiter x header x trimWhitespace x dynamicTyping x escapeCharacter x ??? = a lot).

Test Suit

I've found some sets of exhaustive tests we can use / get inspired by:

(will have to check licenses)

I'll implement them as soon as I find time 😅

Footnotes

  1. It might be controversial, leading to PapaParse fork (https://www.npmjs.com/package/@simwrapper/papaparse), so it should definitely be opt in

@A-D-E-A
Copy link
Contributor

A-D-E-A commented Apr 24, 2025

That's awesome!
If we can find a way to use import attributes, it would be even better!
I don't know how, but I know there's a type of import that actually checks the attributes, the sqlite database import. In case we build the app as a single-file executable, the attribute embed can be read (https://bun.sh/docs/bundler/executables#embed-sqlite-databases). No matter how hard I tried to understand how it was fetched from the source code, I couldn't find/understand it.

It would be great to have the attributes for using the header, but also the attributes for the field and row delimiters. That way, all "csv-like" formats (tsv, excel csv with ';', and even "ascii-delimited files") would work with a single implementation.

Thank you for your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants