-
Notifications
You must be signed in to change notification settings - Fork 132
feat: IOFactory and explicit Dataclass usage #1395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat: IOFactory and explicit Dataclass usage #1395
Conversation
…as precursor to column-joining, with a hard-requirement that a form is stored and can be decoded with checking
…s can construct and utilize them directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces IOFactory and explicit dataclass structures to handle dataset specifications, transitioning from dictionary-based representations to strongly-typed dataclasses. This is a preparatory change for enabling parquet reading and column-joining functionality.
- Adds new dataclass structures for different file formats (ParquetFileSpec, CoffeaParquetFileSpec variants)
- Introduces IOFactory class to convert between dictionaries and dataclass representations
- Updates preprocess function to handle both dictionary and dataclass inputs
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
File | Description |
---|---|
src/coffea/dataset_tools/preprocess.py | Adds new dataclass definitions, IOFactory class, and updates preprocess function to handle dataclass inputs |
src/coffea/dataset_tools/init.py | Exports new dataclass types and IOFactory for public API |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…r ParquetFileSpec
This PR is a precursor to both parquet reading being re-enabled (pre-processing data, most importantly) and column-joining. This exposes and expands the explicit Dataclasses, and provides an
IOFactory
which can convert between the currently-used dictionaries and explicit dataclass versions of them.preprocess
is updated to handle dataclasses explicitly. DatasetJoinableSpec needs to be compatible withpreprocess
(and upcoming PR feature,preprocess_parquet
) so it is upstreamed from thecolumn-join
repoThese dataclasses are especially crucial for column-joining, as creating a joinable dataset spec becomes complicated when one must combine two disparate datasets (with different forms, if one wishes to understand their structure and determine necessary columns automatically) AND a non-trivial join specification (which tells column-joining how to handle the inputs to deliver a joined dataset to the user), and that spec needs to be preprocessed (separately for root and parquet sub-datasets to be joined) then recomposed into a joinable (explicit) version of the joinable spec
We might be able to utilize these in the future to simplify some code (at the expense of this
IOFactory
code, and dual-paths in thepreprocess
function) by always coercing user inputs to dataclasses (and converting back, where appropriate)Still could use some tests and examples in the docs, hopefully Copilot can provide these.
Serialization of these is not yet considered, may require some custom methods to be enabled (which might suggest pydantic instead?), but as-is they can always be converted to json-friendly dictionaries and vice-versa after deserialization.