Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: Clarify Boolean Interval handling and verify it with a test #7885

Closed
wants to merge 6 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 135 additions & 10 deletions datafusion/physical-expr/src/intervals/interval_aritmetic.rs
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,23 @@ impl Display for IntervalBound {
/// called *unbounded* endpoint and represented using a `NULL` and written using
/// `∞`.
///
/// # Boolean Handling
///
/// Boolean values require special handling. Boolean Intervals NEVER have open
/// bounds. If you try to and construct one with open bounds,
/// [`Interval::new`] will remap them to one of the three valid values.
///
/// Given there are only two boolean values, and they are ordered such that
/// `false` is less than `true`, there are only three possible valid intervals
/// for a boolean:
///
/// 1. `[false, false]`: definitely `false`
/// 2. `[false, true]`: either `false` or `true`
/// 3. `[true, true]`: definitely `true`
///
/// [`Interval::new`] takes this into account and adjusts any arguments as
/// needed.
///
/// # Examples
///
/// A `Int64` `Interval` of `[10, 20)` represents the values `10, 11, ... 18,
Expand Down Expand Up @@ -235,18 +252,16 @@ impl Display for Interval {
impl Interval {
/// Creates a new interval object using the given bounds.
///
/// # Boolean intervals need special handling
/// As explained in [`Interval]` boolean `Interval`s are special and this
Copy link
Contributor Author

@alamb alamb Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there are only three valid boolean intervals, and Interval::new() normalizes any provided into into one of those, it might make sense to make the fields of Interval non pub so that it is not possible to construct an invalid Interval

What do you think @metesynnada / @berkaysynnada / @ozankabak ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, that's exactly what I'm working on. I have planned to set the fields to private and create intervals with only try_new. If an interval can be created with the given parameters, it creates that interval. If the parameters cannot construct a valid interval (like lower: [true - upper: false] for a boolean interval), it returns an error.

/// function ensures they are one of the three valid values:
///
/// For boolean intervals, having an open false lower bound is equivalent to
/// having a true closed lower bound. Similarly, open true upper bound is
/// equivalent to having a false closed upper bound. Also for boolean
/// intervals, having an unbounded left endpoint is equivalent to having a
/// false closed lower bound, while having an unbounded right endpoint is
/// equivalent to having a true closed upper bound. Therefore; input
/// parameters to construct an Interval can have different types, but they
/// all result in `[false, false]`, `[false, true]` or `[true, true]`.
/// 1. An open `false` lower bound is mapped to a `true` closed lower bound.
/// 2. An open `true` upper bound is mapped to `false` closed
/// upper bound.
/// 3. An unbounded lower endpoints is mapped to a `false` closed lower
/// bound
/// 4. An unbounded upper endpoint is mapped to a `true` closed upper bound.
pub fn new(lower: IntervalBound, upper: IntervalBound) -> Interval {
// Boolean intervals need a special handling.
if let ScalarValue::Boolean(_) = lower.value {
let standardized_lower = match lower.value {
ScalarValue::Boolean(None) if lower.open => {
Expand Down Expand Up @@ -1079,6 +1094,116 @@ mod tests {
Interval::make(lower, upper, (false, false))
}

#[test]
fn boolean_interval_test() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Demonstrate / document how boolean Intervals work
#[derive(Debug)]
struct TestCase {
lower: bool,
upper: bool,
/// expected bounds when lower is open, upper is open
expected_open_open: (bool, bool),
/// expected bounds when lower is open, upper is closed
expected_open_closed: (bool, bool),
/// expected bounds when lower is closed, upper is open
expected_closed_open: (bool, bool),
// Not: closed/closed is the same as lower/upper
}

let cases = vec![
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was really confused about what the expected Intervals were, so I added test cases. it would be great if someone could double check these

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they look good according to the rules added above. But unbounded rules seem not tested?

Also an open false upper bound and an open true lower bound are not in the rules. From the test cases, looks like they are not mapped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But unbounded rules seem not tested?

This is a good point, I will add tests for them

TestCase {
lower: false,
upper: false,
expected_open_open: (true, false), // whole range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An open false lower bound is mapped according to rule 1.

An open false upper bound is not mapped, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is what the tests show, and I did not change the behavior. However, I don't know if this is correct or not. Perhaps @berkaysynnada / @metesynnada have some insight

Copy link
Contributor

@berkaysynnada berkaysynnada Oct 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second output (expected_open_closed) of this TestCase is (false, false] is the same as [true, false] which is an invalid interval. As I mentioned there, I think these cases should give an error. Maybe we can wait until I propose the PR which I am working on and planning to submit in a few days. It will explain these initialization issues neatly.

expected_open_closed: (true, false),
expected_closed_open: (false, false),
},
TestCase {
lower: false,
upper: true,
expected_open_open: (true, false), // whole range
expected_open_closed: (true, true),
expected_closed_open: (false, false),
},
TestCase {
lower: true,
upper: false,
expected_open_open: (true, false), // whole range
expected_open_closed: (true, false),
expected_closed_open: (true, false),
},
TestCase {
lower: true,
upper: true,
expected_open_open: (true, false), // whole range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An open true upper bound is mapped according to rule 2.

An open true lower bound is not mapped, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact when I look at this test now it doesn't make sense as (true, false) isn't a valid Interval according to my understanding -- the interval should be (false, true) to represent the entire range 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lower bound cannot be a true - open, and vice versa an upper bound cannot be a false - open. A boolean interval representing the entire range only can be [false, true].

Copy link
Contributor Author

@alamb alamb Oct 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @berkaysynnada -- this was my understanding as well. So given that this test passes with the current implementation, that suggestes to me it is a bugwhat do you suggest we do?

  1. I wait for your refactoring PR
  2. Fix the code to not create invalid intervals?
  3. Something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will be clarifying these issues and introducing a solid API in my PR. Is it possible to keep you waiting for a while, because they will all cause conflict for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll handle any conflicts after your new PR lands -- no worries!

expected_open_closed: (true, true),
expected_closed_open: (true, false),
},
];

// Asserts that the given interval has the expected (closed) bounds
fn assert_bool_interval(
name: &str,
interval: Interval,
expected_lower: bool,
expected_upper: bool,
) {
let open = false;
let expected_interval = Interval {
// boolean bounds are always closed
lower: IntervalBound {
value: ScalarValue::Boolean(Some(expected_lower)),
open,
},
upper: IntervalBound {
value: ScalarValue::Boolean(Some(expected_upper)),
open,
},
};
assert_eq!(
interval, expected_interval,
"{name} failure\n\nActual: {interval:#?}\n\nExpected: {expected_interval:#?}"
);
}

for case in cases {
let TestCase {
lower,
upper,
expected_open_open,
expected_open_closed,
expected_closed_open,
} = case;

assert_bool_interval(
"open_open",
open_open(Some(lower), Some(upper)),
expected_open_open.0,
expected_open_open.1,
);
assert_bool_interval(
"open_closed",
open_closed(Some(lower), Some(upper)),
expected_open_closed.0,
expected_open_closed.1,
);
assert_bool_interval(
"closed_open",
closed_open(Some(lower), Some(upper)),
expected_closed_open.0,
expected_closed_open.1,
);
assert_bool_interval(
"closed_closed",
closed_closed(Some(lower), Some(upper)),
lower,
upper,
);
}

Ok(())
}

#[test]
fn intersect_test() -> Result<()> {
let possible_cases = vec![
Expand Down