Skip to content

Conversation

@NJManganelli
Copy link
Collaborator

This PR is a precursor to both parquet reading being re-enabled (pre-processing parquet data, most importantly) and column-joining (where dataset specifications become very complicated and hard-to-construct/validate by hand). This exposes and expands the explicit Dataclasses using Pydantic models. preprocess is updated to handle these dataclasses explicitly. Column-joining imposes some requirements on the status of a dataset for automatic column-determination, so a joinable() method is introduced to check that status after preprocessing, though some followup work may be generated by propagating the pydantic models through all the followup PRs.

Sidenote: These dataclasses are especially crucial for column-joining, as creating a joinable dataset spec becomes complicated when one must combine two disparate datasets (with different forms, if one wishes to understand their structure and determine necessary columns automatically) AND a non-trivial join specification (which tells column-joining how to handle the inputs to deliver a joined dataset to the user), and that spec needs to be preprocessed (separately for root and parquet sub-datasets to be joined) then recomposed into a joinable (i.e. non-Optional FileSpec) version of the joinable spec

For now, most of the simpler functions interacting with datasets have been updated with dual-path code to simultaneously support legacy pure-dictionary inputs and the pydantic models. This includes e.g. apply_to_(file|data)set, slice_(chunks|files), preprocess, etc. An example notebook including usage is introduced. In the future, once these are thoroughly tested in the wild, the old legacy dictionaries can be removed from code paths, since the Pydantic models already handle conversions trivially)

A followup PR will introduce the preprocess_parquet function. Another will be needed to handle threading the classes through the runner (where there's an open question of whether to harmonize the runner and preprocess+apply_to_fileset interfaces first). Another is expected to finish the changes for processing parquet again.

Tests should cover the overwhelming majority of cases, originally generated by Copilot, with heavy editing, expansion, corrections. The test code is not concise, unfortunately.

Pydantic automatically buys us serialization, as they can be saved into json via each model's .model_dump() function

This PR is an alternative to #1395

NOTE: considering the huge number of commits, squash and merge may be desired instead...

@NJManganelli
Copy link
Collaborator Author

@lgray @nsmith- @ikrommyd This is ready for a fresh set of eyes, particularly for anything I may have missed/overlooked so far.

Keep in mind, supporting runner, preprocess_parquet needs to come in followup PRs (latter will need to be converted for pydantic, former I have not considered yet - do we want to (semi-)unify runner and preprocess+apply_to_fileset interfaces first?)

@NJManganelli NJManganelli force-pushed the parquet-precursor-pydantic-datafactory branch from 581245f to 187113f Compare August 20, 2025 00:47
@NJManganelli NJManganelli requested a review from Copilot August 20, 2025 01:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces Pydantic-based dataclasses to replace dictionary-based dataset specifications, providing type safety, validation, and serialization capabilities for Coffea's dataset processing workflow. This is a foundational change to support upcoming parquet reading and column-joining features.

Key changes:

  • Introduces comprehensive Pydantic models for file specifications (UprootFileSpec, ParquetFileSpec, and their Coffea variants)
  • Adds DatasetSpec and FilesetSpec classes with validation and serialization
  • Updates preprocessing and manipulation functions to support both legacy dictionaries and new Pydantic models
  • Provides dual-path compatibility for gradual migration

Reviewed Changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/coffea/dataset_tools/filespec.py Core Pydantic models defining file and dataset specifications with validation
tests/test_dataset_tools_filespec.py Comprehensive test suite for all new Pydantic models and validation logic
src/coffea/dataset_tools/preprocess.py Updated preprocessing to handle both dict and Pydantic model inputs
src/coffea/dataset_tools/manipulations.py Modified manipulation functions for dual-path compatibility
tests/test_dataset_tools.py Extended existing tests to validate both dict and Pydantic model workflows
src/coffea/dataset_tools/apply_processor.py Updated processor application to handle new model types
src/coffea/dataset_tools/__init__.py Added exports for new Pydantic classes
pyproject.toml Added Pydantic dependency
docs/source/examples.rst Added reference to new filespec notebook

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@NJManganelli
Copy link
Collaborator Author

Ping @lgray

@lgray
Copy link
Collaborator

lgray commented Aug 27, 2025

@NJManganelli looking at the volume of code that's been produced I assume this is AI generated for the most part? Not that it matters, just understanding how to approach it.

@NJManganelli
Copy link
Collaborator Author

The core code (non-tests/notebooks code, in particular filespec.py) is 100% human-written (w/ occasional copilot suggestions/fixes, but it was the pydantic conversion of non-assisted code I wrote before).

For the tests, it was a mix, lets call it 2/3rds Copilot-initiated and 1/3rd by-hand (edits + consolidating via parametrization for the copilot tests, and updating older tests to explicitly parametrize Pydantic variation).

Notebook is almost entirely Copilot, with some edits for style, presentation, and corrections.

Hopefully that helps in what/how to review them respectively

@lgray
Copy link
Collaborator

lgray commented Aug 27, 2025

An initial comment looking through the user-facing partsfilespec.ipynb the very last tutorial just stops without explaining everything it sets out to do. The rest of the tutorials seem well covering.

@lgray
Copy link
Collaborator

lgray commented Aug 27, 2025

Aside from that one additional comment and some pondering, I think this looks reasonable.

I appreciate the automation of writing core functionality unit tests.

@ikrommyd anything that catches your attention?

@lgray lgray changed the title feat: Introduce pydantic dataclasses feat: Make preprocess rigorous with IOFactory and pydantic dataclasses Aug 27, 2025
@lgray
Copy link
Collaborator

lgray commented Aug 27, 2025

As for unifying the actual processing and running interfaces.

My idea around apply_to_fileset in the old-style processor based framework is starting settle towards it producing generator of all the futures much like apply_to_fileset does in the dask case

Dask vs. non-dask can be driven by a boolean flag passed to apply_to_fileset, which should be the same as the mode flag for nanoevents and start out unset.

Then we introduce one new function coffea.compute(computable, runner=None, **dask_kwargs), where computable is a dask collection or the future container that we produce from apply_to_fileset in the non-dask case. It then returns the result of dask.compute on the dask collection or it returns the accumulated result of running the given futures.

@NJManganelli
Copy link
Collaborator Author

An initial comment looking through the user-facing partsfilespec.ipynb the very last tutorial just stops without explaining everything it sets out to do. The rest of the tutorials seem well covering.

Yeah, I deleted Copilot's suggestion for the IOFactory, partially because it was lying, and partially because while that class was the User-facing gateway to using the pure dataclass implementation (and Copilot treated it as such), Pydantic automates so much (for example, I had a function which would recurse down and convert the dataclass to pure dictionary, when the right option was set, or would only 'unwrap' the top-most layer and give you a dictionary with possibly-dataclass values(). With pydantic, these are respectively accomplished by just calling AModel.model_dump() and dict(AModel)). I was starting to think the IOFactory might be removed completely, but it's also the base class for the column-join IOJoinFactory, and that has a fair bit more work to do, and I haven't yet propagated this all through to see if that can be eliminated/converted to standalone functions yet.

In short, I was thinking of not advertising the IOFactory much anymore for users, in case it makes more sense to break into standalone functions.

But I'll write an example for the few use-cases it still has

Let me also add the .pq support and consolidate the joinable/check code.

For the magic-byte checking, lemme have a look into that in tandem with the preprocess_parquet PR. If I get that, I'll add it for both preprocess and preprocess_parquet (and maybe I can upstream some of the functionality from join_preprocess that's in column-join to coffea, a third of which is just dispatching root/parquet preprocessing to the appropriate function variation and re-ziping the fileset together at the end. That would let us just unify into a single user-facing function too.

@lgray
Copy link
Collaborator

lgray commented Aug 28, 2025

Ah - OK if it's not user facing then we shouldn't talk about it unless we have to. :-) It just sorta stuck out to me!

For the magic-byte checking I started thinking too much about the right way to do it. The end of that thinking on my side is:

  • I don't think we need to check the magic bytes, uproot and pyarrow.parquet do this for us
  • what we should do is protect the user against feeding incomprehensible data to the system, and also tell the user when a file is mislabeled.

i.e. add in some logic if a file doesn't open to check if a root file is actually parquet (and vice versa), and if it's not either valid input format then we mark the file as bad, otherwise open the file as what it really is based on the filename guess and then make a note that it has the wrong postfix.

This is more a creature comfort than anything else, consider it low priority.

@NJManganelli
Copy link
Collaborator Author

Sounds appropriate to me, do you think the format checking should be exclusively in e.g. preprocess, or both preprocess and apply_to_*set?

@lgray
Copy link
Collaborator

lgray commented Aug 28, 2025

That's a good question.

My first instinct is that once it's in apply_to_*set we should error rather than politely inform, since the user should providing that level of processing sanitized inputs of known-to-be-good files which it's at scale. When it's just messing around (like process/threads executors) we should allow globs, the jobs are typically fast enough at that scale to where time is not really wasted.

This is just the beginning of an idea though.

Nick Manganelli and others added 12 commits August 29, 2025 12:32
…as precursor to column-joining, with a hard-requirement that a form is stored and can be decoded with checking
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 25 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@NJManganelli
Copy link
Collaborator Author

@lgray @nsmith- I am done, barring new change requests.

@nsmith- nsmith- self-requested a review November 18, 2025 03:56
@NJManganelli
Copy link
Collaborator Author

@nsmith- All resolved I think

Copy link
Member

@nsmith- nsmith- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@ikrommyd
Copy link
Collaborator

Let me just check the docs preview and it's good from my side if that's fine

@ikrommyd
Copy link
Collaborator

Should be good now!

@NJManganelli NJManganelli merged commit 316ec82 into scikit-hep:master Nov 19, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants