Skip to content

Feat/parquet read options #150#168

Open
Sao-Ali wants to merge 2 commits intoDataHaskell:mainfrom
Sao-Ali:feat/parquet-read-options-150
Open

Feat/parquet read options #150#168
Sao-Ali wants to merge 2 commits intoDataHaskell:mainfrom
Sao-Ali:feat/parquet-read-options-150

Conversation

@Sao-Ali
Copy link

@Sao-Ali Sao-Ali commented Feb 26, 2026

Problem: Parquet reads were all-or-nothing. Users could not subset columns at read-time, control timestamp-to-day conversion, or subset rows while loading. This issue also required preserving current behavior for existing callers.

Solution:

  • introduce ParquetReadOptions (selectedColumns, timestampPolicy, rowRange) plus defaultParquetReadOptions.
  • Add readParquetWithOpts/readParquetFilesWithOpts and keep readParquet/readParquetFiles as default-option wrappers.
  • Wire selectedColumns into decode-time filtering with fail-fast ColumnNotFoundException for missing requested columns.
  • Wire timestampPolicy with PreserveTimestampPrecision and CoerceTimestampToDay behaviors, including fallback coercion for already-decoded UTCTime columns.
  • Wire rowRange through the reader and apply global rowRange semantics for readParquetFilesWithOpts after concatenation.

Tradeoffs and rationale:

  • chose an options record instead of multiple specialized APIs to keep extension points coherent and avoid API sprawl.
  • Kept legacy conversion wrappers/helpers (applyLogicalType and UTC helpers) to reduce compatibility risk for existing/internal call paths.
  • read-time projection improves performance by skipping unselected chunk decode; rowRange currently uses post-read slicing semantics (start inclusive, end exclusive) for correctness and consistency with existing range behavior.

Verification: add focused Parquet tests for selectedColumns, rowRange, timestampPolicy coercion, and missing selected column errors; run full suite successfully via cabal test (all passing).

…nd row range

Problem: Parquet reads were all-or-nothing. Users could not subset columns at read-time, control timestamp-to-day conversion, or subset rows while loading. This issue also required preserving current behavior for existing callers.

Solution: introduce ParquetReadOptions (selectedColumns, timestampPolicy, rowRange) plus defaultParquetReadOptions. Add readParquetWithOpts/readParquetFilesWithOpts and keep readParquet/readParquetFiles as default-option wrappers. Wire selectedColumns into decode-time filtering with fail-fast ColumnNotFoundException for missing requested columns. Wire timestampPolicy with PreserveTimestampPrecision and CoerceTimestampToDay behaviors, including fallback coercion for already-decoded UTCTime columns. Wire rowRange through the reader and apply global rowRange semantics for readParquetFilesWithOpts after concatenation.

Tradeoffs and rationale: chose an options record instead of multiple specialized APIs to keep extension points coherent and avoid API sprawl. Kept legacy conversion wrappers/helpers (applyLogicalType and UTC helpers) to reduce compatibility risk for existing/internal call paths. read-time projection improves performance by skipping unselected chunk decode; rowRange currently uses post-read slicing semantics (start inclusive, end exclusive) for correctness and consistency with existing range behavior.

Verification: add focused Parquet tests for selectedColumns, rowRange, timestampPolicy coercion, and missing selected column errors; run full suite successfully via cabal test (all passing).
Apply formatter-driven layout updates in Parquet read-options code and related tests.

No behavior change; this commit is formatting-only after lint/format checks.
@Sao-Ali
Copy link
Author

Sao-Ali commented Feb 26, 2026

I tried to implemented read options with backward compatibility as a core constraint, so existing functions and the default behavior should remain unchanged while adding the new feature. Let me know if the test cases make sense or if I need more.


data ParquetTimestampPolicy
= PreserveTimestampPrecision
| CoerceTimestampToDay
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem as fundamental. Let's hold off on it.

deriving (Eq, Show)

data ParquetReadOptions = ParquetReadOptions
{ selectedColumns :: Maybe [T.Text]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet is useful for predicate as read options. so as you're reading a file (or series of files) you can do some filtering.

let x = F.col @(Maybe Text) "x"
let opts = defaultParquetReadOpts
  { predicate = x ./= Nothing .&& (x .<= F.lit (Just 100)) }
D.readParquetWithOpts

This will be extremely useful for reading globs.

Nothing -> True
Just selected ->
let fullPath = T.intercalate "." (map T.pack colPath)
in colName `S.member` selected || fullPath `S.member` selected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not worry about nested fields for now. The reader doesn't even have a good way to support them.

pure (applyRowRange opts (mconcat dfs))

applyRowRange :: ParquetReadOptions -> DataFrame -> DataFrame
applyRowRange opts df = case rowRange opts of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

fmap (DS.range df) (rowRange opts)

Or use <$>.

TestCase
( assertEqual
"rowRangeWithOpts"
(D.range (2, 5) allTypes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is circular since if the range function is broken/produces the wrong result this will still pass. Should just test against the expect dimensions.

)
)

missingSelectedColumnWithOpts :: Test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This was a good implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants