Skip to content

feat: IOFactory and explicit Dataclass usage #1395

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

NJManganelli
Copy link
Collaborator

@NJManganelli NJManganelli commented Aug 14, 2025

This PR is a precursor to both parquet reading being re-enabled (pre-processing data, most importantly) and column-joining. This exposes and expands the explicit Dataclasses, and provides an IOFactory which can convert between the currently-used dictionaries and explicit dataclass versions of them. preprocess is updated to handle dataclasses explicitly. DatasetJoinableSpec needs to be compatible with preprocess (and upcoming PR feature, preprocess_parquet) so it is upstreamed from the column-join repo

These dataclasses are especially crucial for column-joining, as creating a joinable dataset spec becomes complicated when one must combine two disparate datasets (with different forms, if one wishes to understand their structure and determine necessary columns automatically) AND a non-trivial join specification (which tells column-joining how to handle the inputs to deliver a joined dataset to the user), and that spec needs to be preprocessed (separately for root and parquet sub-datasets to be joined) then recomposed into a joinable (explicit) version of the joinable spec

We might be able to utilize these in the future to simplify some code (at the expense of this IOFactory code, and dual-paths in the preprocess function) by always coercing user inputs to dataclasses (and converting back, where appropriate)

Still could use some tests and examples in the docs, hopefully Copilot can provide these.

Serialization of these is not yet considered, may require some custom methods to be enabled (which might suggest pydantic instead?), but as-is they can always be converted to json-friendly dictionaries and vice-versa after deserialization.

Nick Manganelli added 5 commits August 14, 2025 12:26
@NJManganelli NJManganelli requested a review from Copilot August 14, 2025 20:41
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces IOFactory and explicit dataclass structures to handle dataset specifications, transitioning from dictionary-based representations to strongly-typed dataclasses. This is a preparatory change for enabling parquet reading and column-joining functionality.

  • Adds new dataclass structures for different file formats (ParquetFileSpec, CoffeaParquetFileSpec variants)
  • Introduces IOFactory class to convert between dictionaries and dataclass representations
  • Updates preprocess function to handle both dictionary and dataclass inputs

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
src/coffea/dataset_tools/preprocess.py Adds new dataclass definitions, IOFactory class, and updates preprocess function to handle dataclass inputs
src/coffea/dataset_tools/init.py Exports new dataclass types and IOFactory for public API

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

NJManganelli and others added 4 commits August 14, 2025 15:48
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@NJManganelli NJManganelli changed the title [FEAT] IOFactory and explicit Dataclass usage feat: IOFactory and explicit Dataclass usage Aug 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant