Skip to content

Conversation

seanmcilroy29
Copy link
Contributor

@seanmcilroy29 seanmcilroy29 commented Jun 10, 2025

Update estimate_current_region_metadata.py

Major Statistical Improvements

Weighted Trend Calculation: Replaced simple averaging with exponential weighting that gives more importance to recent data points, providing more accurate trend estimates.

Outlier Detection: Added z-score based outlier detection to filter unreliable historical data before trend calculation.

Uncertainty Measures: Now calculates and stores uncertainty metrics based on trend variance for transparency.

Enhanced Data Validation

Comprehensive Input Validation:

  • Validates all required columns per GSF specification
  • Checks data ranges for CFE (0-1), PUE (≥1.0), and carbon intensity (≥0)
  • Warns about outdated or future-dated data

Proper Missing Data Handling: Uses "NA" instead of empty strings to comply with the specification requirement that "attempting to consume a not-available or blank metric should cause any calculations to fail".

Configurable Constraints

Removed Hard-coded Values: All constraints are now configurable through the EstimationConfig class, eliminating arbitrary assumptions like the us-east1 WUE value.

Business Logic Constraints: Properly categorized constraints for different metric types (carbon intensity, CFE percentages, efficiency metrics).

Production-Ready Features

Comprehensive Logging: Added structured logging for debugging, audit trails, and monitoring estimation quality.

Object-Oriented Design: Organized code into a clean class structure for maintainability and testability.

Command-Line Interface: Full argparse implementation with examples and configurable parameters.

Metadata Generation: Creates accompanying metadata files documenting estimation methodology, parameters, and data lineage.

Error Handling: Robust error handling with meaningful error messages and graceful failure modes.

Performance & Quality

Type Hints: Full type annotation for better IDE support and code documentation.

Vectorized Operations: More efficient pandas operations for better performance with large datasets.

Precision Handling: Improved decimal precision preservation based on input data characteristics.

Input Validation: Comprehensive validation prevents runtime errors and provides clear feedback.

Key Usage Examples

# Basic estimation for 1 year
python estimate_metadata.py data.csv

# Estimate 2 years with custom output
python estimate_metadata.py data.csv --years 2 --output estimates_2025_2026.csv

# Custom parameters with verbose logging
python estimate_metadata.py data.csv --years 3 --min-pue 1.05 --outlier-threshold 2.5 --verbose

The rewritten code now provides:

  • Statistically sound trend calculations
  • Full GSF specification compliance
  • Production-ready robustness
  • Complete audit trail through metadata
  • Configurable parameters for different use cases
  • Comprehensive validation preventing data quality issues

This version is suitable for generating reliable cloud region metadata estimates that can be trusted for carbon footprint calculations and regulatory compliance.

Update estimate_current_region_metadata.py

Signed-off-by: Sean Mcilroy <smcilroy@linuxfoundation.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant