Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Apr 4, 2023

Which issue does this PR close?

re #5772

Rationale for this change

As I mentioned in #5772 (review) , I found the cognitive load of dealing with Vec<Option<Vec<PhysicalSortRequirements>>>> challenging and the fact that PhysicalSortRequirements had another Option inside it was also challenging when reading the code .

Also, it appears that the make_sort_requirements_from_exprs function was not public outside of the datafusion-physical-expr crate (so I could not used it while updating IOx -- see https://github.com/influxdata/influxdb_iox/pull/7440). It isn't hard to rewrite but I wanted to help others potentially upgrading by adding more comments and hints of how to create this type of structure.

I also think having additional documentation (and method names that are self documenting) adding methods makes the intent clearer when reading.

What changes are included in this PR?

  1. Make fields on PhysicalSortRequirement private
  2. Add doc strings and examples of what the different options are, and the relevant constructors
  3. Add From impls

Are these changes tested?

Covered by existing tests

Are there any user-facing changes?

@alamb alamb marked this pull request as draft April 4, 2023 13:27
@github-actions github-actions bot added core Core DataFusion crate physical-expr Changes to the physical-expr crates labels Apr 4, 2023
options,
}
} else {
PhysicalSortExpr {
Copy link
Contributor Author

@alamb alamb Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved into PhysicalSortExpr::into_sort_expr() where I think it is easier to find and understand (turns out converting back / forth from PhysicalSortExpr and PhysicalSortRequirement is common


fn required_input_ordering(&self) -> Vec<Option<Vec<PhysicalSortRequirement>>> {
vec![Some(make_sort_requirements_from_exprs(&self.expr))]
vec![Some(PhysicalSortRequirement::exact_from_exprs(&self.expr))]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am imagining that when a user upgrades DataFusion they will and see that the type signature changed from PhysicalSortExpr to PhysicalSortRequirement. Then when they look at PhysicalSortRequirement they can find a function to convert the existing PhysicalSortExpr to PhysicalSortRequirement -- PhysicalSortRequirement::exact_from_exprs

make_sort_requirements_from_exprs was both not documented as well as lost in the list of functions

@ozankabak
Copy link
Contributor

ozankabak commented Apr 4, 2023

This LGTM. However, let's discuss timing. We have a type alias PR that is waiting for #5661. IMO we should do this after the alias PR as de-conflicting that may be much harder than de-conflicting this.

@alamb
Copy link
Contributor Author

alamb commented Apr 4, 2023

IMO we should do this after the alias PR as de-conflicting that may be much harder than de-conflicting this.

I agree -- this one is a nice to have and has no particular urgency. Lets get your PRs merged first

@alamb alamb force-pushed the alamb/easier_physical_expr branch from 9f111b5 to 0602ae4 Compare April 4, 2023 14:49
@ozankabak
Copy link
Contributor

The alias PR is #5867. I think apart from a simple name change of make_sort_requirements_from_exprs, it should be relatively easy to merge this with that. If we get these two merged up, the state of affairs on this should improve somewhat

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minor comments but LGTM in general

/// Option to specify how the given column should be sorted.
/// If unspecified, there is no constraint on sort options.
pub options: Option<SortOptions>,
/// If unspecified, there are constraints on sort options.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems incorrect, did you mean to start with "If specified, ..."?

Copy link
Contributor Author

@alamb alamb Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. I will fix it (restore to the original)

Update: fixed in a5cc4c2

Comment on lines 150 to 168
/// Creates a new `exact` requirement, which must match the
/// required options and expression. See
/// [`PhysicalSortRequirement`] for examples.
pub fn new_exact(expr: Arc<dyn PhysicalExpr>, options: SortOptions) -> Self {
Self {
expr,
options: Some(options),
}
}

/// Creates a new `expr_only` requirement, which must match the
/// required expression. See [`PhysicalSortRequirement`] for
/// examples.
pub fn new_expr_only(expr: Arc<dyn PhysicalExpr>) -> Self {
Self {
expr,
options: None,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could have a single new and the caller can pass the second argument as None if they don't want to constrain sort options. I get the sense that you have two news because you think it is more readable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - that is precisely the reason.

For example, I thought it was clearer that the goal was to add a relaxed sort requirement in datafusion/core/src/physical_plan/windows/mod.rs with:

 for partition_by in partition_by_exprs {
        sort_reqs.push(PhysicalSortRequirement::new_expr_only(partition_by.clone()))
    }

Rather than

 for partition_by in partition_by_exprs {
       sort_reqs.push(PhysicalSortRequirement {
            expr: partition_by.clone(),
            options: None,
        });
    }

Of course they both do the same thing eventually and I agree it is a preference thing about what makes for more readable code

Copy link
Contributor

@ozankabak ozankabak Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative I had in mind was

for partition_by in partition_by_exprs {
    sort_reqs.push(PhysicalSortRequirement::new(partition_by.clone(), None));
    // sort_reqs.push(PhysicalSortRequirement::new(partition_by.clone(), Some(sort_options)));
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make that change if you prefer it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 8a073c4

pub fn into_sort_expr(self) -> PhysicalSortExpr {
let Self { expr, options } = self;

let options = options.unwrap_or(SortOptions {
Copy link
Contributor

@ozankabak ozankabak Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest unwrap_or_else to explicitly avoid constructing a potentially unused object. The compiler will likely take care of this, but I find making intent visible in code structure is good practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed -- I agree with you. I originally had unwrap_or_else for precisely the reasons you mentioned, but clippy told me to use unwrap_or instead 😭

error: unnecessary closure used to substitute value for `Option::None`
   --> datafusion/physical-expr/src/sort_expr.rs:185:23
    |
185 |           let options = options.unwrap_or_else(|| SortOptions {
    |  _______________________^
186 | |             descending: false,
187 | |             nulls_first: false,
188 | |         });
    | |__________^
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#unnecessary_lazy_evaluations
    = note: `-D clippy::unnecessary-lazy-evaluations` implied by `-D warnings`
help: use `unwrap_or(..)` instead
    |
185 ~         let options = options.unwrap_or(SortOptions {
186 +             descending: false,
187 +             nulls_first: false,
188 ~         });
    |

Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review @ozankabak

Not sure if either @mustafasrepo or @mingmwang are interested in this PR either, given they have been working in this area

Comment on lines +213 to +215
pub fn to_sort_exprs(
requirements: impl IntoIterator<Item = PhysicalSortRequirement>,
) -> Vec<PhysicalSortExpr> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use

pub fn to_sort_exprs<'a>(
        requirements: impl IntoIterator<Item = &'a PhysicalSortRequirement>,
    ) -> Vec<PhysicalSortExpr> 

as signature for to_sort_exprs as in from_sort_exprs. This would help remove a couple clones in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My rationale for taking PhysicalSortRequirement rather than &PhysicalSortRequirement was actually to avoid a clone in the cases where the caller already had an owned PhysicalSortRequirement. Basically the clone() is moved from inside this function to the callsite

Specifically in datafusion/core/src/physical_optimizer/sort_enforcement.rs

                    // Make sure we preserve the ordering requirements:
                    update_child_to_remove_unnecessary_sort(child, sort_onwards, &plan)?;
                    let sort_expr =
                        PhysicalSortRequirement::to_sort_exprs(required_ordering);
                    add_sort_above(child, sort_expr)?;
                    if is_sort(child) {

This signature can avoid a clone.

If the signature looks like

pub fn to_sort_exprs<'a>(
        requirements: impl IntoIterator<Item = &'a PhysicalSortRequirement>,
    ) -> Vec<PhysicalSortExpr> 

I think it would require always clone()ing the expr and options internally to create an owned PhysicalSortExpr in the output

Of course, we are talking about cloning an Arc and some SortOptions which presumably isn't all that expensive , so maybe it doesn't matter 🤔

Copy link
Contributor

@ozankabak ozankabak Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling there may be a way to do this with a Borrow trick to implement a single function that can deal with both owned and referenced arguments without duplicating code 🤔

Copy link
Contributor

@ozankabak ozankabak Apr 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb, it seems like using

pub fn to_sort_exprs<T: Borrow<PhysicalSortRequirement>>(
    requirements: impl IntoIterator<Item = T>,
) -> Vec<PhysicalSortExpr> {
    requirements
        .into_iter()
        .map(|e| PhysicalSortExpr::from(e.borrow()))
        .collect()
}

pub fn from_sort_exprs<T: Borrow<PhysicalSortExpr>>(
    ordering: impl IntoIterator<Item = T>,
) -> Vec<PhysicalSortRequirement> {
    ordering
        .into_iter()
        .map(|e| PhysicalSortRequirement::from(e.borrow()))
        .collect()
}

lets us get rid of a bunch of clones and results in simpler call site code. AFAIK, this is a zero-cost mechanism. If you concur, we can use this pattern.

I created #11 on your repo so you can easily check it out and incorporate (assuming I'm not overlooking something and this is indeed zero cost obviously).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my point was that I think this call:

PhysicalSortRequirement::from(e.borrow()

Invokes this From impl:

impl From<&PhysicalSortRequirement> for PhysicalSortExpr {
    fn from(value: &PhysicalSortRequirement) -> Self {
        value.clone().into_sort_expr()
    }
}

Which has a clone of the value

I have removed the From<&PhysicalSortRequirement> traits in 12cecdb because it is obscuring the locations of the cloning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I thought you wanted to make to_sort_exprs "smart" in the sense that it can take work with an owned value or a reference, using the owned or reference From impl depending on the argument -- making a clone only when necessary.

Anyway I got curious and tried to verify whether the borrow mechanism was zero-cost for this purpose; it turns out it is not. It always calls the reference path. I also found how to do this -- here is an example for future reference.

For posterity, had we wanted to go this route, the smart to_sort_exprs would be:

pub fn to_sort_exprs<T>(
    requirements: impl IntoIterator<Item = T>,
) -> Vec<PhysicalSortExpr>
where
    PhysicalSortExpr: From<T>,
{
    requirements
        .into_iter()
        .map(|e| PhysicalSortExpr::from(e))
        .collect()
}

which would route to the right PhysicalSortExpr::from depending on the argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool -- I did not know that. If you think that is a worthwhile change, I would be happy to make a follow on PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be a good change, I can already see some use cases for it in some upcoming code (e.g. streaming group bys).

I updated the mini-PR I created for experimentation so you can get the changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks -- I tried to pull the commits into #5918 but they still seem to be trying to use this impl:

impl From<&PhysicalSortRequirement> for PhysicalSortExpr {

I may have messed up the merge somehow

Let's keep collaborating on #5918

Comment on lines 83 to 91
/// If the requirement is *exact*
/// ([`PhysicalSortRequirement::new_exact`]), then the sort
/// requirement will only be satisfied if it matches both the
/// expression *and* the sort options.
///
/// If the requirement is *`expr_only`
/// ([`PhysicalSortRequirement::new_expr_only`]) then only the expr
/// must match, to satisfy the requirement.
///
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we now use PhysicalSortRequirement::new for initialization for both cases. These comments seem remainder for old API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 that is a good spot -- fixed in 1db636a

@mustafasrepo
Copy link
Contributor

I have made some suggestions. Besides these suggestions, LGTM!. Thanks @alamb.

@alamb alamb changed the title Encapsulate PhysicalSortRequrement and add more doc comment Encapsulate PhysicalSortRequrement and add more doc comments Apr 7, 2023
@alamb alamb mentioned this pull request Apr 7, 2023
@alamb alamb merged commit 10b9352 into apache:main Apr 7, 2023
@alamb
Copy link
Contributor Author

alamb commented Apr 7, 2023

Thank you for the reviews and comments @mustafasrepo and @ozankabak

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants