refactor: Improve skimming and metadata code organization, naming, and S3 support #8

MoAly98 · 2025-10-23T09:54:50Z

Overview

This PR refactors the skimming and metadata extraction subsystems to improve code maintainability, discoverability, and user experience. The changes focus on clearer naming, better documentation, logical code organization, and robust S3 storage support for distributed processing.

Key Improvements

1. Naming & Documentation

Replaced vague function names with clear, action-oriented names
Enhanced docstrings with comprehensive parameter/return documentation
Added module-level constants to eliminate magic strings
Improved inline comments for complex logic

2. Code Organization

Reorganized utils/skimming.py (1094 lines) into 7 logical sections with clear headers
Moved main entry point to end of file for better discoverability
Grouped related helper functions by responsibility

3. S3 Storage Support

Added WorkerEval class for worker-side environment variable evaluation
Implemented _resolve_lazy_values() for recursive lazy evaluation
Updated S3 configuration to support distributed credentials
Added comprehensive S3 example configuration in example_cms/configs/skim.py

4. Interactive Demo

Added demo_workflow.ipynb with full workflow demonstration
Configurable DIST/S3 flags for flexible deployment
Clear documentation of configuration combinations

Breaking Changes

Function Renamings

utils/metadata_extractor.py:

_parse_dataset() → parse_dataset_key()
summarise_nanoaods() → summarize_event_counts()

utils/skimming.py:

workitem_analysis() → process_workitem()
reduce_results() → merge_results()
_build_output_suffix() → _build_output_path()
process_workitems_with_skimming() → process_and_load_events()

New Constants

# utils/metadata_extractor.py
DATASET_DELIMITER = "__"
DEFAULT_VARIATION = "nominal"

# utils/skimming.py
COUNTER_DELIMITER = "::"
ENTRY_RANGE_DELIMITER = "_"

Files Changed

Core utilities:

utils/schema.py - Added WorkerEval class
utils/skimming.py - Complete reorganization + lazy evaluation support
utils/metadata_extractor.py - Improved naming and documentation
utils/datasets.py - Minor updates for consistency

Configuration:

example_cms/configs/skim.py - New file with S3 storage configuration
example_cms/configs/configuration.py - New consolidated config
example_opendata/configs/*.py - Updated configs for opendata example

Documentation:

demo_workflow.ipynb - New interactive demonstration notebook
README.md - Updated to reflect new structure

Entry points:

analysis.py - Updated to use new function names
dev/dev_test_skimming*.py - Updated to use new function names

Migration Guide

For Users

Update function calls in your code:

# Before
from utils.skimming import process_workitems_with_skimming
datasets = process_workitems_with_skimming(workitems, config, ...)

# After
from utils.skimming import process_and_load_events
datasets = process_and_load_events(workitems, config, ...)

For S3 Storage

Configure worker-side credentials using WorkerEval:

from utils.schema import WorkerEval
import os

skimming_config = {
    "output": {
        "format": "parquet",
        "protocol": "s3",
        "to_kwargs": {
            "storage_options": {
                "key": WorkerEval(lambda: os.environ['AWS_ACCESS_KEY_ID']),
                "secret": WorkerEval(lambda: os.environ['AWS_SECRET_ACCESS_KEY']),
                "client_kwargs": {
                    "endpoint_url": "https://your-s3-endpoint.com"
                }
            }
        }
    }
}

- Add Dataset dataclass to encapsulate logical datasets across multiple directories - Support multiple directories with corresponding cross-sections per dataset - Always create separate fileset entries for multi-directory datasets - Histograms naturally accumulate during analysis (no explicit aggregation needed) - Update metadata extraction to handle directory/cross-section mapping - Update skimming to populate Dataset.events with per-directory metadata - Update analysis pipeline to process Dataset objects instead of dict - Add all CMS datasets to skim_demo.py config with cross-section extraction helper 🤖 Generated with [Claude Code](https://claude.com/claude-code)

- Update skimming cells to use Dataset objects instead of fileset dict - Update analysis cells to iterate over Dataset objects - Update output display to show Dataset structure with splits 🤖 Generated with [Claude Code](https://claude.com/claude-code)

2018 has runs A,B,C,D while 2016/2017 have B,C,D,E,F

…f files written on coffea casa (no s3 integration)

…tocols and format

… not just client

alexander-held

Let's make sure we squash when merging as there are some fairly big files in the history here.

MoAly98 and others added 30 commits October 7, 2025 10:48

fixes and add demo notebook

eca9e0a

improve

e4cc23a

fixes and clean

0c6ee18

add example-demo folder to separate from more complete example

d10e89e

dask

2cf470d

pip install for coffea casa setup

10b27b3

intermediate prep for full run-2 dataset

195aa76

missing file

b9064a5

fix: allow variable-length tuples in DatasetConfig schema

8df719b

fix: use correct run periods for 2018 data

3ee8cde

2018 has runs A,B,C,D while 2016/2017 have B,C,D,E,F

feat: add detailed failure tracking for skimming

d9d31b6

feat: log failures after each retry and make max_retries configurable

537a604

debug: disable workitem output writing for testing

43fb1b7

feat: replace ROOT with parquet for skimmed output

281add0

feat: add processes filter to all pipeline stages

5b16bb8

feat: upgrade coffea to 2025.10.0 and fix parquet schema handling

b2b72ba

chore: update demo workflow and dev test for coffea 2025.10.0

5258e06

fix: change muon trigger name

ed91b17

fix: correct nevts lookup for multi-directory datasets

ec7b4bd

debug: some prints and subset of workitems processing to debug lack o…

b0d2d88

…f files written on coffea casa (no s3 integration)

check: try to support writing via S3

7fe2916

remove extra files from repo

e356d40

debug: checkpoint

a74e2fb

add dataset sizes summaries to queries

5296bb5

cleanup pt1

5bfb218

gitattributes to filterout notebook outputs

da096fe

wrap pip insall in demoworkflow notebook inside COFFEA_CASA logic

e04c685

remove output files from repo

6a98957

MoAly98 added 17 commits October 22, 2025 12:28

renames

c88a702

remove duplicate

bcb4dc1

separate cms from opendata datasets and corrections

cbe9bbb

re-add dataset sizes in queries.py

32cf62c

move configs into example/ directories for more organisation

3901d58

fix imports

f74c2bf

remove user/ directory since everything been organised into examples

48cac1d

make data handling more robust

60a4323

feat: improve handling skimming outputs to support local, remote, pro…

6c503ea

…tocols and format

clean up some configs

bc8d16d

more clean up after changes

d578906

Update README

8954a67

docstrings and functio names

ea7817c

reorder skimming

47abf23

allow lazy callables in config to be able to run functions on workers…

89cdaef

… not just client

fixes

db15dc2

remove old files

549a99e

MoAly98 changed the title ~~refactor: many fixes, refactoring and adding features to support running on coffea-casa~~ refactor: Improve skimming and metadata code organization, naming, and S3 support Oct 23, 2025

alexander-held reviewed Oct 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

refactor: Improve skimming and metadata code organization, naming, and S3 support #8

refactor: Improve skimming and metadata code organization, naming, and S3 support #8

Uh oh!

MoAly98 commented Oct 23, 2025 •

edited

Loading

Uh oh!

alexander-held left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

refactor: Improve skimming and metadata code organization, naming, and S3 support #8

Are you sure you want to change the base?

refactor: Improve skimming and metadata code organization, naming, and S3 support #8

Uh oh!

Conversation

MoAly98 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Improvements

1. Naming & Documentation

2. Code Organization

3. S3 Storage Support

4. Interactive Demo

Breaking Changes

Function Renamings

New Constants

Files Changed

Migration Guide

For Users

For S3 Storage

Uh oh!

alexander-held left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MoAly98 commented Oct 23, 2025 •

edited

Loading