-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Interval work into Integration branch #419
base: v0.2-integration
Are you sure you want to change the base?
Conversation
Updated .gitignore to include the PyCharm Databricks plugin directory. This change helps to keep unnecessary files from being tracked in the repository. It maintains a cleaner working directory and avoids potential clutter.
Added an IntervalContext class to handle interval-related data effectively. This change centralizes the management of start, end, and metric column data, reducing redundancy in function parameters. The refactoring also enhances code readability and maintenance by encapsulating interval attributes within a single entity, thereby simplifying function signatures and logic.
Replaced start and end timestamp arguments with IntervalContext in all test cases to improve readability and maintainability. This change encapsulates interval details within a single context object, ensuring consistent behavior across interval-related test functions. Additionally, some invalid test cases were commented out to prevent unintended test failures.
Refactors the IntervalContext class to a more streamlined Interval class, enhancing code readability and structure. This change simplifies function signatures by embedding data directly into the Interval class, reducing complexity and redundancy across various interval operations. Corresponding updates have been applied to tests to maintain consistency and ensure functionality.
This refactor improves the interval overlap resolution logic by modularizing each scenario into separate functions for better readability and maintainability. The helper functions validate input intervals, handle specific overlapping scenarios, and ensure that the intervals are processed in the correct order. Additionally, this update removes optional interval parameters and streamlines parameter naming for clarity.
Removed redundant and overly complex argument handling in `add_as_disjoint`, simplifying its use and improving readability. Added a `boundaries` property to `Interval` for cleaner boundary access. Updated relevant tests to match the function interface changes.
Introduce a class method `Interval.create` for cleaner initialization. Enhance `make_disjoint_wrap` by adding detailed documentation and handling for empty DataFrames. Refine test coverage to validate column integrity in disjoint interval results.
The `create` method was redundant and no longer utilized in the codebase. Its removal simplifies the class and avoids unnecessary maintenance. All impacted references have been updated accordingly.
…erformance. Replaces the `itertools.islice`-based loop with a cleaner, function-based approach using `DataFrame.apply`. This improves readability and makes the code more concise by encapsulating logic in a dedicated function.
Reorganized interval logic into an `OverlapResolver` class for better modularity and added several validation checks. Introduced methods to handle cases for overlap resolution, boundary equivalence, and metric column merging. Simplified interval comparisons and ensured intervals are processed consistently.
Moved `update_interval_boundary` functionality into the `OverlapResolver` class and introduced `update_other_boundary` for handling `other` intervals. Updated all calls and tests to use the new instance methods. This improves encapsulation and simplifies interval management.
Reordered arguments in merge_metrics calls to ensure consistency and improve code readability. This change affects multiple sections where new_interval and new_other are swapped for clarity and logical alignment.
Encapsulated interval utility functions into the `IntervalsUtils` class for better modularity and reusability. Adjusted the logic in related methods and tests to leverage the new class structure while preserving existing functionality and test coverage.
Replaced `identify_interval_overlaps` with a more modular implementation using `_calculate_all_overlaps` and `find_overlaps`. Moved `resolve_all_overlaps` into the `IntervalsUtils` class for better encapsulation. Updated tests to reflect these changes, ensuring consistent functionality.
Introduced abstract classes and specific implementations for overlap checking, resolution, and metric merging strategies. This modular approach replaces hardcoded conditional checks, enabling better scalability and readability. Simplified `IntervalTransformer` by delegating responsibilities to newly added checkers and resolvers.
Replaced string-based overlap types with an Enum for improved readability and maintainability. This standardizes overlap type references and simplifies their usage across checkers and resolvers.
Introduced `IntervalValidator` for modular validation logic and replaced generic exceptions with specific error classes like `InvalidTimestampError`. Updated the `Interval` class to use the new validator, simplifying initialization. Tests were updated to reflect the new validation structure and error handling.
Introduce `IntervalOverlapDetector` class to streamline overlap detection using a strategy pattern. Replace specific overlap types (e.g., `START_BOUNDARY_EQUAL`, `END_BOUNDARY_EQUAL`) with generalized `COMMON_START` and `COMMON_END` types. This refactoring simplifies checker and resolver management, making the code more extensible and maintainable.
Renames `IntervalOverlapDetector` to `OverlapDetector` for clarity and updates checker ordering to improve logic flow. Adds `PartialOverlapChecker` to enhance overlap detection and integrates it into detection and resolution workflows. Reorganizes overlap type handling for better maintainability and readability.
Simplified the resolution process by delegating overlap detection and resolver selection to the `OverlapDetector`. Removed redundant conditional checks and streamlined the workflow to ensure maintainability and better error handling. Updated the `EquivalentMetricsResolver` return type for consistency.
The docstring was unnecessary as the method names and logic in the code are self-explanatory. This simplification improves code readability and avoids duplication of information.
Eliminated the `self.checkers` attribute, as it was unused in the code. This improves code maintainability by removing unnecessary complexity and reducing potential confusion.
Introduced ResolutionManager and ResolutionResult classes to manage overlap resolution strategies. Updated IntervalTransformer to utilize the new ResolutionManager for resolving overlaps. This improves modularity and extensibility for interval processing.
The `resolvers` dictionary was unnecessary and not referenced anywhere in the code. Its removal simplifies the `intervals.py` file by eliminating unused code, improving maintainability and clarity.
Changed `start_ts` and `end_ts` parameters to be required instead of optional in the interval class. Also removed unused resolver lookup code, simplifying overlap resolution logic.
Changed `start_ts` and `end_ts` parameters to be required instead of optional in the interval class. Also removed unused resolver lookup code, simplifying overlap resolution logic.
Replaced row-wise DataFrame operations with numpy-based processing to improve performance. Simplified logic for creating and handling disjoint intervals, and ensured better compatibility by using pandas Series. Updated imports for necessary type hints and utilities.
Ensure `series_ids` is correctly handled when None is passed by initializing it as an empty list before validation. Also, improve type hints and streamline the `_validate_series_ids` method for better clarity and robustness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR merges major changes to the interval handling system—including overlap resolution, datetime format inference, metric merging, boundary conversion, interval validation, and improvements to as‐of joins. Key changes include:
- Implementation of multiple overlap resolver and checker classes for intervals.
- New utilities for datetime inference, metric merging, boundaries conversion, and interval DataFrame operations.
- Enhancements and refactoring in core interval classes, exceptions, and as‑of join strategies.
Reviewed Changes
Copilot reviewed 55 out of 55 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
python/tempo/intervals/overlap/resolution.py | New overlap resolver classes for different overlap scenarios |
python/tempo/intervals/datetime/utils.py | Implements datetime format inference with multiple regex replacements |
python/tempo/intervals/metrics/merger.py | Adds metric merger base class and default implementation |
python/tempo/intervals/core/boundaries.py | Introduces boundary conversion and helper classes for intervals |
python/tempo/intervals/overlap/detection.py | New overlap detection strategies for various interval relationships |
python/tempo/intervals/core/validation.py | Adds interval data and column validation helpers |
python/tempo/intervals/metrics/operations.py | New configuration and operations for metric merge strategies |
python/tempo/intervals/core/intervals_df.py | Enhancements for wrapping Spark DataFrames for intervals |
python/tempo/intervals/core/exceptions.py | Centralizes interval-related exception definitions and error messages |
python/tempo/intervals/core/utils.py | New utilities for handling and resolving overlapping intervals |
python/tempo/intervals/core/interval.py | Core interval class with operations for updating boundaries and merging metrics |
python/tempo/intervals/core/types.py | Defines interval boundary and metric value types |
python/tempo/as_of_join.py | Refactors as‑of join strategy implementations and utility functions |
Comments suppressed due to low confidence (2)
python/tempo/intervals/core/utils.py:236
- [nitpick] Using a nonlocal variable and reassigning it within the nested function 'resolve_and_add' can make the code harder to follow and maintain. Consider refactoring this logic to use an accumulator or return a new DataFrame without modifying a nonlocal variable.
nonlocal disjoint_intervals
python/tempo/as_of_join.py:63
- [nitpick] Consider using the '&' operator (e.g. set(left.columns) & set(right.columns)) for set intersections to improve readability and conciseness.
return set(left.columns).intersection(set(right.columns)) - self.commonSeriesIDs(left, right)
* making contributing.md slightly more clear * Remove `Analyze` job from test.yml (#423) - This job called CodeQL which is broken due to new firewall rules --------- Co-authored-by: Lorin Dawson <22798188+R7L208@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the few areas of mention.
leading_margin: int = 1, | ||
lagging_margin: int = 0 | ||
lagging_margin: int = 0, | ||
) -> TSDF: | ||
""" | ||
Interpolate missing values in a time series column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add docstrings for the parameters of this method. This is a major function.
num_intervals: Optional[int] = None, | ||
ts_colname: str = "ts", | ||
include_interval_ends: bool = False, | ||
) -> DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a docstring here to describe exactly what the start/end are, optional parameters, etc.
# return only the rows that were missing (others are margins) | ||
return pdf[any_interpol_mask] | ||
|
||
return interpolator_fn | ||
|
||
|
||
def interpolate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function now handles a lot more than interpolation — includes time alignment, data selection, and now interval logic. It’s worth breaking into sub-methods or even a utility class.
from tempo.intervals.metrics.operations import MetricMergeConfig | ||
|
||
|
||
class Interval: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add optional logging or .explain() methods in the Interval class to allow devs to trace how intervals are merged, discarded, or fail validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update tests. Right now, imports from tempo are broken (from tempo import * to run a basic test).
Tests appear to be failing here also - https://github.com/databrickslabs/tempo/actions/runs/14095291765/job/39481313776
Changes
This Pr merges Interval work that splits out functionality between IntervalsDF and other components to facilitate the v0.2-integration work. Development will continue on the
v0.2-integration
branch once this is merged.Linked issues
Resolves #..
Functionality
...
...
...
...
...
Tests