feat: IOFactory and explicit Dataclass usage #1395

NJManganelli · 2025-08-14T20:41:19Z

This PR is a precursor to both parquet reading being re-enabled (pre-processing data, most importantly) and column-joining. This exposes and expands the explicit Dataclasses, and provides an IOFactory which can convert between the currently-used dictionaries and explicit dataclass versions of them. preprocess is updated to handle dataclasses explicitly. DatasetJoinableSpec needs to be compatible with preprocess (and upcoming PR feature, preprocess_parquet) so it is upstreamed from the column-join repo

These dataclasses are especially crucial for column-joining, as creating a joinable dataset spec becomes complicated when one must combine two disparate datasets (with different forms, if one wishes to understand their structure and determine necessary columns automatically) AND a non-trivial join specification (which tells column-joining how to handle the inputs to deliver a joined dataset to the user), and that spec needs to be preprocessed (separately for root and parquet sub-datasets to be joined) then recomposed into a joinable (explicit) version of the joinable spec

We might be able to utilize these in the future to simplify some code (at the expense of this IOFactory code, and dual-paths in the preprocess function) by always coercing user inputs to dataclasses (and converting back, where appropriate)

Still could use some tests and examples in the docs, hopefully Copilot can provide these.

Serialization of these is not yet considered, may require some custom methods to be enabled (which might suggest pydantic instead?), but as-is they can always be converted to json-friendly dictionaries and vice-versa after deserialization.

…column-joining

…as precursor to column-joining, with a hard-requirement that a form is stored and can be decoded with checking

…s can construct and utilize them directly

Copilot

Pull Request Overview

This PR introduces IOFactory and explicit dataclass structures to handle dataset specifications, transitioning from dictionary-based representations to strongly-typed dataclasses. This is a preparatory change for enabling parquet reading and column-joining functionality.

Adds new dataclass structures for different file formats (ParquetFileSpec, CoffeaParquetFileSpec variants)
Introduces IOFactory class to convert between dictionaries and dataclass representations
Updates preprocess function to handle both dictionary and dataclass inputs

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
src/coffea/dataset_tools/preprocess.py	Adds new dataclass definitions, IOFactory class, and updates preprocess function to handle dataclass inputs
src/coffea/dataset_tools/init.py	Exports new dataclass types and IOFactory for public API

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/coffea/dataset_tools/preprocess.py

src/coffea/dataset_tools/__init__.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…r ParquetFileSpec

Nick Manganelli added 5 commits August 14, 2025 12:26

Expand explicit dataclasses as precursor for re-enabling parquet and …

7800a7e

…column-joining

Expose dataclasses

bc9c723

Add IOFactory for dataclass to dict conversions, DatasetJoinableSpec …

86b60fc

…as precursor to column-joining, with a hard-requirement that a form is stored and can be decoded with checking

Expose IOFactory and DatasetJoinableSpec

6589f0a

Thread DatasetSpecs through preprocess function for handling, so user…

28b2eae

…s can construct and utilize them directly

NJManganelli requested a review from Copilot August 14, 2025 20:41

Copilot AI reviewed Aug 14, 2025

View reviewed changes

NJManganelli and others added 4 commits August 14, 2025 15:48

Update src/coffea/dataset_tools/__init__.py

14ee837

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/coffea/dataset_tools/__init__.py

cdee886

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/coffea/dataset_tools/preprocess.py

5534230

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Error when filespec_to_dict receives an incompatible UprootFileSpec o…

02783ef

…r ParquetFileSpec

NJManganelli changed the title ~~[FEAT] IOFactory and explicit Dataclass usage~~ feat: IOFactory and explicit Dataclass usage Aug 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: IOFactory and explicit Dataclass usage #1395

feat: IOFactory and explicit Dataclass usage #1395

Uh oh!

NJManganelli commented Aug 14, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: IOFactory and explicit Dataclass usage #1395

Are you sure you want to change the base?

feat: IOFactory and explicit Dataclass usage #1395

Uh oh!

Conversation

NJManganelli commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NJManganelli commented Aug 14, 2025 •

edited

Loading