Skip to content

Conversation

@MohannadAK
Copy link
Collaborator

Summary

This PR adds comprehensive time series feature engineering capabilities to BigFeat while maintaining 100% backward compatibility with the existing implementation. When time series features are disabled (default), the library behaves identically to the original version.

Motivation

  • Gap in Time Series Support: Original BigFeat lacked temporal awareness for time series data
  • Growing Demand: Time series feature engineering is crucial for financial, IoT, and forecasting applications
  • Preserve Existing Functionality: Ensure zero breaking changes for current users
  • Extend Operator Set: Add powerful temporal operators while maintaining BigFeat's core philosophy

Key Features Added

Time Series Operators (15 New)

  • Rolling Operations: rolling_mean, rolling_std, rolling_min/max, rolling_median, rolling_sum
  • Temporal Transforms: lag_feature, diff_feature, pct_change, momentum
  • Advanced Analytics: ewm, seasonal_decompose, trend_feature
  • Cyclical Patterns: weekday_mean, month_mean

DateTime-Aware Processing

  • Automatic detection and handling of datetime columns
  • Time-based window operations using pandas Timedelta
  • Support for grouped time series (multiple entities in one dataset)
  • Flexible time period parsing ('7D', '30D', '3M', '1Y', etc.)

Robust Implementation

  • Intelligent fallback mechanisms for all operators
  • Comprehensive error handling and data validation
  • Memory-efficient processing for large time series
  • Clean separation between time series and standard operations

Technical Implementation

New Parameters

BigFeat(
    task_type='classification',          # Original parameter
    enable_time_series=False,            # Enable time series features
    window_sizes=['7D', '30D', '90D'],   # Rolling window sizes
    lag_periods=['1D', '7D', '14D'],     # Lag periods
    datetime_col='timestamp',            # DateTime column name
    groupby_cols=['entity_id'],          # Grouping columns
    verbose=True                         # Progress reporting
)

Smart DataFrame Handling

  • Automatically excludes datetime columns from feature engineering
  • Preserves temporal information for time-based operations
  • Falls back to numpy arrays when DataFrames not provided
  • Handles mixed data types gracefully

Backward Compatibility

Zero Breaking Changes

  • Default behavior: enable_time_series=False
  • All existing methods have identical signatures
  • Same output format and data types
  • Identical results for same inputs with same random seeds

Before/After Comparison

# Original usage (unchanged)
bf = BigFeat(task_type='classification')
features = bf.fit(X, y)

# Enhanced usage (new capabilities)
bf = BigFeat(
    task_type='classification',
    enable_time_series=True,
    datetime_col='timestamp'
)
features = bf.fit(df_with_datetime, y)

Testing Strategy

Regression Testing

  • All original test cases pass unchanged
  • Same random seeds produce identical results (time series disabled)
  • Performance benchmarks maintained for standard usage
  • Memory usage comparable for non-time-series operations

New Feature Testing

  • Time series operators with various window sizes
  • Grouped time series processing
  • Edge cases (missing data, irregular intervals)
  • DataFrame vs numpy array input handling
  • Error handling and fallback mechanisms

Performance Impact

Standard Operations

  • No performance degradation when enable_time_series=False
  • Identical memory usage for existing workflows
  • Same computational complexity for original operators

Time Series Operations

  • Efficient pandas-based rolling operations
  • Vectorized computations where possible
  • Lazy evaluation to minimize memory usage
  • Intelligent caching for group operations

Usage Examples

Basic Time Series Enhancement

import pandas as pd

# Time series data
df = pd.DataFrame({
    'timestamp': pd.date_range('2020-01-01', periods=1000),
    'sales': np.random.randn(1000).cumsum(),
    'price': np.random.randn(1000) + 100
})

# Enhanced BigFeat
bf = BigFeat(
    enable_time_series=True,
    datetime_col='timestamp',
    window_sizes=['7D', '30D', '90D']
)

features = bf.fit(df, target)

Multi-Entity Time Series

# Multiple time series in one dataset
df = pd.DataFrame({
    'timestamp': pd.date_range('2020-01-01', periods=1000).repeat(3),
    'entity_id': ['A', 'B', 'C'] * 1000,
    'value': np.random.randn(3000)
})

bf = BigFeat(
    enable_time_series=True,
    datetime_col='timestamp',
    groupby_cols=['entity_id']
)

features = bf.fit(df, target)

Code Quality

Architecture

  • Clean separation of concerns between time series and standard operations
  • Modular design with dedicated time series utility methods
  • Consistent error handling patterns across all new methods
  • Comprehensive documentation and type hints

Error Handling

  • All time series operations wrapped in try-catch blocks
  • Graceful fallbacks to standard operations when time series fails
  • Data validation at multiple stages
  • Informative warning messages for debugging

Documentation

  • Comprehensive docstrings for all new methods
  • Updated README with time series examples
  • Migration guide for existing users
  • API reference for new parameters

Benefits

For Existing Users

  • Zero disruption: Continue using BigFeat exactly as before
  • Optional upgrade path: Enable time series when needed
  • Same performance: No overhead when time series disabled

For Time Series Users

  • Powerful operators: 15 new temporal feature engineering operators
  • Production ready: Robust error handling and performance optimization
  • Flexible configuration: Customizable windows and lag periods
  • Multi-entity support: Handle complex time series datasets

For the Ecosystem

  • Expanded use cases: BigFeat now applicable to time series domains
  • Maintained philosophy: Automatic feature engineering with temporal awareness
  • Research potential: New opportunities for time series feature discovery

Future Enhancements

This implementation provides a solid foundation for future time series enhancements:

  • Seasonal decomposition algorithms
  • Frequency domain features (FFT-based)
  • Advanced trend detection methods
  • Cross-series relationship features

Checklist

  • Code implementation complete
  • All existing tests pass
  • New functionality tested
  • Documentation updated
  • Performance benchmarked
  • Backward compatibility verified
  • Error handling implemented
  • Code reviewed internally

Review Focus Areas

  1. Backward Compatibility: Verify existing functionality unchanged
  2. Time Series Logic: Review temporal operation implementations
  3. Error Handling: Check robustness of fallback mechanisms
  4. Performance: Ensure no regression for standard operations
  5. Documentation: Confirm clarity of new parameters and usage

This PR transforms BigFeat into a comprehensive feature engineering tool that handles both traditional and time series data while preserving the simplicity and power of the original design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants