Skip to content

Conversation

@MoAly98
Copy link
Collaborator

@MoAly98 MoAly98 commented Oct 23, 2025

Overview

This PR refactors the skimming and metadata extraction subsystems to improve code maintainability, discoverability, and user experience. The changes focus on clearer naming, better documentation, logical code organization, and robust S3 storage support for distributed processing.

Key Improvements

1. Naming & Documentation

  • Replaced vague function names with clear, action-oriented names
  • Enhanced docstrings with comprehensive parameter/return documentation
  • Added module-level constants to eliminate magic strings
  • Improved inline comments for complex logic

2. Code Organization

  • Reorganized utils/skimming.py (1094 lines) into 7 logical sections with clear headers
  • Moved main entry point to end of file for better discoverability
  • Grouped related helper functions by responsibility

3. S3 Storage Support

  • Added WorkerEval class for worker-side environment variable evaluation
  • Implemented _resolve_lazy_values() for recursive lazy evaluation
  • Updated S3 configuration to support distributed credentials
  • Added comprehensive S3 example configuration in example_cms/configs/skim.py

4. Interactive Demo

  • Added demo_workflow.ipynb with full workflow demonstration
  • Configurable DIST/S3 flags for flexible deployment
  • Clear documentation of configuration combinations

Breaking Changes

Function Renamings

utils/metadata_extractor.py:

  • _parse_dataset()parse_dataset_key()
  • summarise_nanoaods()summarize_event_counts()

utils/skimming.py:

  • workitem_analysis()process_workitem()
  • reduce_results()merge_results()
  • _build_output_suffix()_build_output_path()
  • process_workitems_with_skimming()process_and_load_events()

New Constants

# utils/metadata_extractor.py
DATASET_DELIMITER = "__"
DEFAULT_VARIATION = "nominal"

# utils/skimming.py
COUNTER_DELIMITER = "::"
ENTRY_RANGE_DELIMITER = "_"

Files Changed

Core utilities:

  • utils/schema.py - Added WorkerEval class
  • utils/skimming.py - Complete reorganization + lazy evaluation support
  • utils/metadata_extractor.py - Improved naming and documentation
  • utils/datasets.py - Minor updates for consistency

Configuration:

  • example_cms/configs/skim.py - New file with S3 storage configuration
  • example_cms/configs/configuration.py - New consolidated config
  • example_opendata/configs/*.py - Updated configs for opendata example

Documentation:

  • demo_workflow.ipynb - New interactive demonstration notebook
  • README.md - Updated to reflect new structure

Entry points:

  • analysis.py - Updated to use new function names
  • dev/dev_test_skimming*.py - Updated to use new function names

Migration Guide

For Users

Update function calls in your code:

# Before
from utils.skimming import process_workitems_with_skimming
datasets = process_workitems_with_skimming(workitems, config, ...)

# After
from utils.skimming import process_and_load_events
datasets = process_and_load_events(workitems, config, ...)

For S3 Storage

Configure worker-side credentials using WorkerEval:

from utils.schema import WorkerEval
import os

skimming_config = {
    "output": {
        "format": "parquet",
        "protocol": "s3",
        "to_kwargs": {
            "storage_options": {
                "key": WorkerEval(lambda: os.environ['AWS_ACCESS_KEY_ID']),
                "secret": WorkerEval(lambda: os.environ['AWS_SECRET_ACCESS_KEY']),
                "client_kwargs": {
                    "endpoint_url": "https://your-s3-endpoint.com"
                }
            }
        }
    }
}

MoAly98 and others added 30 commits October 7, 2025 10:48
- Add Dataset dataclass to encapsulate logical datasets across multiple directories
- Support multiple directories with corresponding cross-sections per dataset
- Always create separate fileset entries for multi-directory datasets
- Histograms naturally accumulate during analysis (no explicit aggregation needed)
- Update metadata extraction to handle directory/cross-section mapping
- Update skimming to populate Dataset.events with per-directory metadata
- Update analysis pipeline to process Dataset objects instead of dict
- Add all CMS datasets to skim_demo.py config with cross-section extraction helper

🤖 Generated with [Claude Code](https://claude.com/claude-code)
- Update skimming cells to use Dataset objects instead of fileset dict
- Update analysis cells to iterate over Dataset objects
- Update output display to show Dataset structure with splits

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2018 has runs A,B,C,D while 2016/2017 have B,C,D,E,F
…f files written on coffea casa (no s3 integration)
@MoAly98 MoAly98 changed the title refactor: many fixes, refactoring and adding features to support running on coffea-casa refactor: Improve skimming and metadata code organization, naming, and S3 support Oct 23, 2025
Copy link
Member

@alexander-held alexander-held left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure we squash when merging as there are some fairly big files in the history here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants