Feat/parquet read options #150 by Sao-Ali · Pull Request #168 · DataHaskell/dataframe

Sao-Ali · 2026-02-26T00:50:29Z

Problem: Parquet reads were all-or-nothing. Users could not subset columns at read-time, control timestamp-to-day conversion, or subset rows while loading. This issue also required preserving current behavior for existing callers.

Solution:

introduce ParquetReadOptions (selectedColumns, timestampPolicy, rowRange) plus defaultParquetReadOptions.
Add readParquetWithOpts/readParquetFilesWithOpts and keep readParquet/readParquetFiles as default-option wrappers.
Wire selectedColumns into decode-time filtering with fail-fast ColumnNotFoundException for missing requested columns.
Wire timestampPolicy with PreserveTimestampPrecision and CoerceTimestampToDay behaviors, including fallback coercion for already-decoded UTCTime columns.
Wire rowRange through the reader and apply global rowRange semantics for readParquetFilesWithOpts after concatenation.

Tradeoffs and rationale:

chose an options record instead of multiple specialized APIs to keep extension points coherent and avoid API sprawl.
Kept legacy conversion wrappers/helpers (applyLogicalType and UTC helpers) to reduce compatibility risk for existing/internal call paths.
read-time projection improves performance by skipping unselected chunk decode; rowRange currently uses post-read slicing semantics (start inclusive, end exclusive) for correctness and consistency with existing range behavior.

Verification: add focused Parquet tests for selectedColumns, rowRange, timestampPolicy coercion, and missing selected column errors; run full suite successfully via cabal test (all passing).

…nd row range Problem: Parquet reads were all-or-nothing. Users could not subset columns at read-time, control timestamp-to-day conversion, or subset rows while loading. This issue also required preserving current behavior for existing callers. Solution: introduce ParquetReadOptions (selectedColumns, timestampPolicy, rowRange) plus defaultParquetReadOptions. Add readParquetWithOpts/readParquetFilesWithOpts and keep readParquet/readParquetFiles as default-option wrappers. Wire selectedColumns into decode-time filtering with fail-fast ColumnNotFoundException for missing requested columns. Wire timestampPolicy with PreserveTimestampPrecision and CoerceTimestampToDay behaviors, including fallback coercion for already-decoded UTCTime columns. Wire rowRange through the reader and apply global rowRange semantics for readParquetFilesWithOpts after concatenation. Tradeoffs and rationale: chose an options record instead of multiple specialized APIs to keep extension points coherent and avoid API sprawl. Kept legacy conversion wrappers/helpers (applyLogicalType and UTC helpers) to reduce compatibility risk for existing/internal call paths. read-time projection improves performance by skipping unselected chunk decode; rowRange currently uses post-read slicing semantics (start inclusive, end exclusive) for correctness and consistency with existing range behavior. Verification: add focused Parquet tests for selectedColumns, rowRange, timestampPolicy coercion, and missing selected column errors; run full suite successfully via cabal test (all passing).

Apply formatter-driven layout updates in Parquet read-options code and related tests. No behavior change; this commit is formatting-only after lint/format checks.

Sao-Ali · 2026-02-26T00:55:27Z

I tried to implemented read options with backward compatibility as a core constraint, so existing functions and the default behavior should remain unchanged while adding the new feature. Let me know if the test cases make sense or if I need more.

mchav · 2026-02-26T05:15:22Z

src/DataFrame/IO/Parquet.hs


+data ParquetTimestampPolicy
+    = PreserveTimestampPrecision
+    | CoerceTimestampToDay


This doesn't seem as fundamental. Let's hold off on it.

mchav · 2026-02-26T05:23:12Z

src/DataFrame/IO/Parquet.hs

+    deriving (Eq, Show)
+
+data ParquetReadOptions = ParquetReadOptions
+    { selectedColumns :: Maybe [T.Text]


Parquet is useful for predicate as read options. so as you're reading a file (or series of files) you can do some filtering.

let x = F.col @(Maybe Text) "x" let opts = defaultParquetReadOpts { predicate = x ./= Nothing .&& (x .<= F.lit (Just 100)) } D.readParquetWithOpts

This will be extremely useful for reading globs.

mchav · 2026-02-26T05:26:26Z

src/DataFrame/IO/Parquet.hs

+                Nothing -> True
+                Just selected ->
+                    let fullPath = T.intercalate "." (map T.pack colPath)
+                     in colName `S.member` selected || fullPath `S.member` selected


Let's not worry about nested fields for now. The reader doesn't even have a good way to support them.

mchav · 2026-02-26T05:29:38Z

src/DataFrame/IO/Parquet.hs

+            pure (applyRowRange opts (mconcat dfs))
+
+applyRowRange :: ParquetReadOptions -> DataFrame -> DataFrame
+applyRowRange opts df = case rowRange opts of


nit:

fmap (DS.range df) (rowRange opts)

Or use <$>.

mchav · 2026-02-26T05:31:07Z

tests/Parquet.hs

+    TestCase
+        ( assertEqual
+            "rowRangeWithOpts"
+            (D.range (2, 5) allTypes)


This is circular since if the range function is broken/produces the wrong result this will still pass. Should just test against the expect dimensions.

mchav · 2026-02-26T05:31:38Z

tests/Parquet.hs

+            )
+        )
+
+missingSelectedColumnWithOpts :: Test


Thanks. This was a good implementation.

Sao-Ali added 2 commits February 25, 2026 15:47

chore(parquet): format reader and Parquet tests

d66b064

Apply formatter-driven layout updates in Parquet read-options code and related tests. No behavior change; this commit is formatting-only after lint/format checks.

mchav requested changes Feb 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/parquet read options #150#168

Feat/parquet read options #150#168
Sao-Ali wants to merge 2 commits intoDataHaskell:mainfrom
Sao-Ali:feat/parquet-read-options-150

Sao-Ali commented Feb 26, 2026

Uh oh!

Sao-Ali commented Feb 26, 2026

Uh oh!

mchav Feb 26, 2026

Uh oh!

mchav Feb 26, 2026

Uh oh!

mchav Feb 26, 2026

Uh oh!

mchav Feb 26, 2026

Uh oh!

mchav Feb 26, 2026

Uh oh!

mchav Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Sao-Ali commented Feb 26, 2026

Uh oh!

Sao-Ali commented Feb 26, 2026

Uh oh!

mchav Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

mchav Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

mchav Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

mchav Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

mchav Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

mchav Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants