Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Interval work into Integration branch #419

Open
wants to merge 94 commits into
base: v0.2-integration
Choose a base branch
from

Conversation

R7L208
Copy link
Contributor

@R7L208 R7L208 commented Mar 13, 2025

Changes

This Pr merges Interval work that splits out functionality between IntervalsDF and other components to facilitate the v0.2-integration work. Development will continue on the v0.2-integration branch once this is merged.

Linked issues

Resolves #..

Functionality

  • added relevant user documentation
  • added a new Class method
  • modified existing Class method: ...
  • added a new function
  • modified existing function: ...
  • added a new test
  • modified existing test: ...
  • added a new example
  • modified existing example: ...
  • added a new utility
  • modified existing utility: ...

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • verified on staging environment (screenshot attached)

Updated .gitignore to include the PyCharm Databricks plugin directory. This change helps to keep unnecessary files from being tracked in the repository. It maintains a cleaner working directory and avoids potential clutter.
Added an IntervalContext class to handle interval-related data effectively. This change centralizes the management of start, end, and metric column data, reducing redundancy in function parameters. The refactoring also enhances code readability and maintenance by encapsulating interval attributes within a single entity, thereby simplifying function signatures and logic.
Replaced start and end timestamp arguments with IntervalContext in all test cases to improve readability and maintainability. This change encapsulates interval details within a single context object, ensuring consistent behavior across interval-related test functions. Additionally, some invalid test cases were commented out to prevent unintended test failures.
Refactors the IntervalContext class to a more streamlined Interval class, enhancing code readability and structure. This change simplifies function signatures by embedding data directly into the Interval class, reducing complexity and redundancy across various interval operations. Corresponding updates have been applied to tests to maintain consistency and ensure functionality.
This refactor improves the interval overlap resolution logic by modularizing each scenario into separate functions for better readability and maintainability. The helper functions validate input intervals, handle specific overlapping scenarios, and ensure that the intervals are processed in the correct order. Additionally, this update removes optional interval parameters and streamlines parameter naming for clarity.
Removed redundant and overly complex argument handling in `add_as_disjoint`, simplifying its use and improving readability. Added a `boundaries` property to `Interval` for cleaner boundary access. Updated relevant tests to match the function interface changes.
Introduce a class method `Interval.create` for cleaner initialization. Enhance `make_disjoint_wrap` by adding detailed documentation and handling for empty DataFrames. Refine test coverage to validate column integrity in disjoint interval results.
The `create` method was redundant and no longer utilized in the codebase. Its removal simplifies the class and avoids unnecessary maintenance. All impacted references have been updated accordingly.
…erformance.

Replaces the `itertools.islice`-based loop with a cleaner, function-based approach using `DataFrame.apply`. This improves readability and makes the code more concise by encapsulating logic in a dedicated function.
Reorganized interval logic into an `OverlapResolver` class for better modularity and added several validation checks. Introduced methods to handle cases for overlap resolution, boundary equivalence, and metric column merging. Simplified interval comparisons and ensured intervals are processed consistently.
Moved `update_interval_boundary` functionality into the `OverlapResolver` class and introduced `update_other_boundary` for handling `other` intervals. Updated all calls and tests to use the new instance methods. This improves encapsulation and simplifies interval management.
Reordered arguments in merge_metrics calls to ensure consistency and improve code readability. This change affects multiple sections where new_interval and new_other are swapped for clarity and logical alignment.
Encapsulated interval utility functions into the `IntervalsUtils` class for better modularity and reusability. Adjusted the logic in related methods and tests to leverage the new class structure while preserving existing functionality and test coverage.
Replaced `identify_interval_overlaps` with a more modular implementation using `_calculate_all_overlaps` and `find_overlaps`. Moved `resolve_all_overlaps` into the `IntervalsUtils` class for better encapsulation. Updated tests to reflect these changes, ensuring consistent functionality.
Introduced abstract classes and specific implementations for overlap checking, resolution, and metric merging strategies. This modular approach replaces hardcoded conditional checks, enabling better scalability and readability. Simplified `IntervalTransformer` by delegating responsibilities to newly added checkers and resolvers.
Replaced string-based overlap types with an Enum for improved readability and maintainability. This standardizes overlap type references and simplifies their usage across checkers and resolvers.
Introduced `IntervalValidator` for modular validation logic and replaced generic exceptions with specific error classes like `InvalidTimestampError`. Updated the `Interval` class to use the new validator, simplifying initialization. Tests were updated to reflect the new validation structure and error handling.
Introduce `IntervalOverlapDetector` class to streamline overlap detection using a strategy pattern. Replace specific overlap types (e.g., `START_BOUNDARY_EQUAL`, `END_BOUNDARY_EQUAL`) with generalized `COMMON_START` and `COMMON_END` types. This refactoring simplifies checker and resolver management, making the code more extensible and maintainable.
Renames `IntervalOverlapDetector` to `OverlapDetector` for clarity and updates checker ordering to improve logic flow. Adds `PartialOverlapChecker` to enhance overlap detection and integrates it into detection and resolution workflows. Reorganizes overlap type handling for better maintainability and readability.
Simplified the resolution process by delegating overlap detection and resolver selection to the `OverlapDetector`. Removed redundant conditional checks and streamlined the workflow to ensure maintainability and better error handling. Updated the `EquivalentMetricsResolver` return type for consistency.
The docstring was unnecessary as the method names and logic in the code are self-explanatory. This simplification improves code readability and avoids duplication of information.
Eliminated the `self.checkers` attribute, as it was unused in the code. This improves code maintainability by removing unnecessary complexity and reducing potential confusion.
Introduced ResolutionManager and ResolutionResult classes to manage overlap resolution strategies. Updated IntervalTransformer to utilize the new ResolutionManager for resolving overlaps. This improves modularity and extensibility for interval processing.
The `resolvers` dictionary was unnecessary and not referenced anywhere in the code. Its removal simplifies the `intervals.py` file by eliminating unused code, improving maintainability and clarity.
Changed `start_ts` and `end_ts` parameters to be required instead of optional in the interval class. Also removed unused resolver lookup code, simplifying overlap resolution logic.
Changed `start_ts` and `end_ts` parameters to be required instead of optional in the interval class. Also removed unused resolver lookup code, simplifying overlap resolution logic.
R7L208 added 4 commits March 13, 2025 15:13
Replaced row-wise DataFrame operations with numpy-based processing to improve performance. Simplified logic for creating and handling disjoint intervals, and ensured better compatibility by using pandas Series. Updated imports for necessary type hints and utilities.
Ensure `series_ids` is correctly handled when None is passed by initializing it as an empty list before validation. Also, improve type hints and streamline the `_validate_series_ids` method for better clarity and robustness.
@R7L208 R7L208 requested a review from Copilot March 13, 2025 22:49
@R7L208 R7L208 self-assigned this Mar 13, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR merges major changes to the interval handling system—including overlap resolution, datetime format inference, metric merging, boundary conversion, interval validation, and improvements to as‐of joins. Key changes include:

  • Implementation of multiple overlap resolver and checker classes for intervals.
  • New utilities for datetime inference, metric merging, boundaries conversion, and interval DataFrame operations.
  • Enhancements and refactoring in core interval classes, exceptions, and as‑of join strategies.

Reviewed Changes

Copilot reviewed 55 out of 55 changed files in this pull request and generated no comments.

Show a summary per file
File Description
python/tempo/intervals/overlap/resolution.py New overlap resolver classes for different overlap scenarios
python/tempo/intervals/datetime/utils.py Implements datetime format inference with multiple regex replacements
python/tempo/intervals/metrics/merger.py Adds metric merger base class and default implementation
python/tempo/intervals/core/boundaries.py Introduces boundary conversion and helper classes for intervals
python/tempo/intervals/overlap/detection.py New overlap detection strategies for various interval relationships
python/tempo/intervals/core/validation.py Adds interval data and column validation helpers
python/tempo/intervals/metrics/operations.py New configuration and operations for metric merge strategies
python/tempo/intervals/core/intervals_df.py Enhancements for wrapping Spark DataFrames for intervals
python/tempo/intervals/core/exceptions.py Centralizes interval-related exception definitions and error messages
python/tempo/intervals/core/utils.py New utilities for handling and resolving overlapping intervals
python/tempo/intervals/core/interval.py Core interval class with operations for updating boundaries and merging metrics
python/tempo/intervals/core/types.py Defines interval boundary and metric value types
python/tempo/as_of_join.py Refactors as‑of join strategy implementations and utility functions
Comments suppressed due to low confidence (2)

python/tempo/intervals/core/utils.py:236

  • [nitpick] Using a nonlocal variable and reassigning it within the nested function 'resolve_and_add' can make the code harder to follow and maintain. Consider refactoring this logic to use an accumulator or return a new DataFrame without modifying a nonlocal variable.
nonlocal disjoint_intervals

python/tempo/as_of_join.py:63

  • [nitpick] Consider using the '&' operator (e.g. set(left.columns) & set(right.columns)) for set intersections to improve readability and conciseness.
return set(left.columns).intersection(set(right.columns)) - self.commonSeriesIDs(left, right)

@R7L208 R7L208 requested a review from kwang-databricks March 20, 2025 17:25
Copy link
Collaborator

@rportilla-databricks rportilla-databricks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the few areas of mention.

leading_margin: int = 1,
lagging_margin: int = 0
lagging_margin: int = 0,
) -> TSDF:
"""
Interpolate missing values in a time series column.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add docstrings for the parameters of this method. This is a major function.

num_intervals: Optional[int] = None,
ts_colname: str = "ts",
include_interval_ends: bool = False,
) -> DataFrame:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a docstring here to describe exactly what the start/end are, optional parameters, etc.

# return only the rows that were missing (others are margins)
return pdf[any_interpol_mask]

return interpolator_fn


def interpolate(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function now handles a lot more than interpolation — includes time alignment, data selection, and now interval logic. It’s worth breaking into sub-methods or even a utility class.

from tempo.intervals.metrics.operations import MetricMergeConfig


class Interval:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add optional logging or .explain() methods in the Interval class to allow devs to trace how intervals are merged, discarded, or fail validation.

Copy link
Collaborator

@rportilla-databricks rportilla-databricks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update tests. Right now, imports from tempo are broken (from tempo import * to run a basic test).

Tests appear to be failing here also - https://github.com/databrickslabs/tempo/actions/runs/14095291765/job/39481313776

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants