`DataFrame.parse()` performance issue for wide DF

We noticed this issue especially when parsing DataFrames with lots of `String` columns, such as a wide CSV file.

If you run `DataFrame.parse()`, each column is getting parsed [one at a time](https://github.com/Kotlin/dataframe/blob/fdec9a1aeb83d07b4d866a81982db86bfe02283a/core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/parse.kt#L391).

If a column has type `String`, then `tryParse` [goes over](https://github.com/Kotlin/dataframe/blob/fdec9a1aeb83d07b4d866a81982db86bfe02283a/core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/parse.kt#L341) each parser in [`Parsers`](https://github.com/Kotlin/dataframe/blob/fdec9a1aeb83d07b4d866a81982db86bfe02283a/core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/parse.kt#L229) and if any of the values of the column cannot be parsed it will try the next parser.

These parsers are ordered like Int -> Long -> Instant -> LocalDateTime -> ... -> Json -> String. Which means that for "normal" string columns that need no parsing, at least one cell in the column is attempted to be parsed 17 times. Many of these attempts are achieved by `catchSilent {}` blocks, which catch any exception thrown an return `null` if they do. And they are heavy: https://www.baeldung.com/java-exceptions-performance.

This is easily measurable by creating two wide dataframes, one with columns that can be parsed to ints and another with cols that cannot be parsed and must remain strings:

![image](https://github.com/user-attachments/assets/6586ff13-e6ab-4028-9085-518bd5b30964)

We can see that parsing the wide string DF takes a considerable amount of time more.

![image](https://github.com/user-attachments/assets/8362e583-7482-436c-9454-bf1fc1282a95)

And this is mostly due to `Instant.parse`, `toLocalDateTimeOrNull`, etc., and, most importantly, all `fillInStackTrace` calls at the top of the graph, a.k.a the exceptions of the parsers. We might be able to improve this :)

![image](https://github.com/user-attachments/assets/d3380f56-40de-461b-8c82-8e9c188d7df5)

Looking at the parsers there are some interesting observations and possible solutions:
- Kotlin's `Instant.parse()` is a lot slower than Java's. We should use Java's and `.toKotlin` it.
- Many parsers are duplicates, like Java's `LocalDateTime` and Kotlin's. If a String can be parsed as date time, it will pick Kotlin's every time. If it cannot, it will fail both the Kotlin and Java one, creating a useless exception. We should drop the java duplicates.
- Exceptions are heavy. `toIntOrNull` and `toLongOrNull` are so fast, the time isn't even shown. If a library offers a `canParse()` function, we should use it.
- We should try to parallelize the parsing. Columns don't depend on each other. Parsing is built on `convert to` which is built on `replace with`, so that's where the parallelization should occur. Relevant issue: https://github.com/Kotlin/dataframe/issues/723




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`DataFrame.parse()` performance issue for wide DF #849

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DataFrame.parse() performance issue for wide DF #849

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`DataFrame.parse()` performance issue for wide DF #849