Skip to content

Predicate join operation #434

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Aug 9, 2023
Merged

Predicate join operation #434

merged 10 commits into from
Aug 9, 2023

Conversation

koperagen
Copy link
Collaborator

Main purpose: joining data on operations other than equals

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Aug 1, 2023

I see you also did some stuff with HTML but I don't really understand what you did there. It seems to have added some extra spacing to each snippet file. Not too important but relevant enough to mention

@@ -10,7 +10,9 @@ public fun <A, B> DataFrame<A>.join(
other: DataFrame<B>,
type: JoinType = JoinType.Inner,
selector: JoinColumnsSelector<A, B>? = null
): DataFrame<A> = joinImpl(other, type, true, selector)
): DataFrame<A> {
return joinImpl(other, type, addNewColumns = type.addNewColumns, selector)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably provide name for each argument in the function, it looks odd like this :). Also, I'm fine with it being an expression function if it's just a one line thing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it look odd too? plus, it's not even me who wrote this initially :D

joinImpl(
  other = other, 
  type = type, 
  addNewColumns = type.addNewColumns,
  selector = selector
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no not at all XD I write functions always like that. IMO positional arguments are a nightmare for refactoring and I think originally they were only allowed áfter the positional ones so mixing them looks extra odd to me.

type: JoinType = JoinType.Inner,
joinExpression: JoinExpression<A, B>
): DataFrame<A> {
return predicateJoinImpl(right, type, addNewColumns = type.addNewColumns, joinExpression)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as in join, I'd make it an expression function and provide named arguments for all

public fun <A, B> DataFrame<A>.filterPredicateJoin(
right: DataFrame<B>,
joinExpression: JoinExpression<A, B>
): DataFrame<A> = predicateJoinImpl(right, JoinType.Inner, addNewColumns = false, joinExpression)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argument names :)

public fun <A, B> DataFrame<A>.predicateJoin(
right: DataFrame<B>,
type: JoinType = JoinType.Inner,
joinExpression: JoinExpression<A, B>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trailing commas everywhere please :)

override val right: DataRow<B> = DataRowImpl(index1, rightOwner)
}

internal fun <A, B> DataFrame<A>.predicateJoinImpl(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments would be nice XD, bit hard to check it like this


Joins two [`DataFrames`](DataFrame.md) by a join expression.

```kotlin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code block contains non-kotlin code, so the highlighting goes wrong

Copy link
Collaborator Author

@koperagen koperagen Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these blocks use pseudocode intentionally everywhere we have them. So should be ok to use here as well. As for highlighting, with kotlin at least valid parts are highlighted. Like interface :) There are no errors on the website that you see in the preview, if you're talking about it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could also split it up in multiple blocks if you want pseudocode :)
Or do like I usually do:

// some pseudo code () -> Something
fun kotlinCode() {}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think that will work, you can see that even inside kotlin code there are symbols like [ ] for optional parts or | for enumeration, etc

}
```

Unlike [join](join.md) similar to one from relational databases, where relations between tables are based on IDs and usually equality between values, predicate join allows any operations that returns Boolean value. That gives you the ability to join data based on various logical operations and heuristics. This can also be useful when dealing with data from disparate sources that are connected, but lack common identifiers or structured relations typical of SQL tables with indices.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fold the line so I don't have to scroll horizontally to read it, haha

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this sentence should probably be broken up into multiple and have some extra commas for readability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i rewrote it completely in the end

Copy link
Collaborator

@Jolanrensen Jolanrensen Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much better :) Just some small notes above

@@ -94,6 +94,7 @@ df.join(other)

Supported join types:
* `Inner` (default) — only matched rows from left and right [`DataFrames`](DataFrame.md)
* `Filter` — only matched rows from left [`DataFrame`](DataFrame.md)
Copy link
Collaborator

@Jolanrensen Jolanrensen Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably add filterJoin below to the joinSpecials as well

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Aug 2, 2023

the name predicateJoin might actually be a bit long and cumbersome, especially if expanded like innerPredicateJoin. Maybe it could be shortened like joinWhere or even better, be combined with join:

campaigns.innerJoin(visits).where {
    right.date in startDate..endDate
}

or in the join dsl:

campaigns.innerJoin(visits) {
    where { right.date in startDate..endDate }
    // or
    withPredicate {}
}

@koperagen
Copy link
Collaborator Author

the name predicateJoin might actually be a bit long and cumbersome, especially if expanded like innerPredicateJoin. Maybe it could be shortened like joinWhere or even better, be combined with join:

campaigns.innerJoin(visits).where {
    right.date in startDate..endDate
}

or in the join dsl:

campaigns.innerJoin(visits) {
    where { right.date in startDate..endDate }
    // or
    withPredicate {}
}

I agree that predicateInnerJoin is too long. I'd prefer to have original join renamed to joinBy and have this to be named join. Another option i can suggest is joinWith. Combining this operation with column join is a stretch in my opinion. Columnar join has default join statement (columns with the same names from both dataframes). So, those 2 samples will have very different meaning

 campaigns.innerJoin(visits).where {
     right.date in startDate..endDate
 }
 campaigns.innerJoin(visits)

@koperagen
Copy link
Collaborator Author

I see you also did some stuff with HTML but I don't really understand what you did there. It seems to have added some extra spacing to each snippet file. Not too important but relevant enough to mention

Yes, spacing changed because it changed in Kotlin sources. trimMargin didn't work because String inserted with interpolation doesn't have any spacing, so common spacing is 0. Maybe i'll find another way to build HTML files

}
```

Unlike [join](join.md) similar to one from relational databases, where relations between tables are based on IDs and usually equality between values, predicate join allows any operations that returns Boolean value. That gives you the ability to join data based on various logical operations and heuristics. This can also be useful when dealing with data from disparate sources that are connected, but lack common identifiers or structured relations typical of SQL tables with indices.
Copy link
Collaborator

@Jolanrensen Jolanrensen Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much better :) Just some small notes above

@koperagen koperagen merged commit 30c4b09 into master Aug 9, 2023
@koperagen
Copy link
Collaborator Author

koperagen commented Aug 9, 2023

@Jolanrensen Thank you!

@Jolanrensen Jolanrensen deleted the predicate-join branch August 9, 2023 12:49
@zaleslaw zaleslaw added this to the 0.12.0 milestone Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants