-
Notifications
You must be signed in to change notification settings - Fork 73
Add Compiler Plugin support for statistics on GroupBy #1077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Introduces the `GroupBySumOf` interpreter for aggregation, enabling the calculation of column sums with customizable expressions and result names in grouped DataFrames. Adds tests and updates APIs to support and validate this feature.
Updated statistical aggregation functions for GroupBy with comments addressing open questions. Added comprehensive tests to verify behavior across various statistics (sum, mean, median, std, min, and max), replacing older test cases for cleaner coverage.
# Conflicts: # plugins/kotlin-dataframe/src/org/jetbrains/kotlinx/dataframe/plugin/loadInterpreter.kt
val resolvedColumns = receiver.groups.columns() | ||
.filter { | ||
it is SimpleDataColumn | ||
&& it.type.type.isSubtypeOfComparable(session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@koperagen here, I am trying to check if types if comparable to emulate the interComparable() internal function behaviour for median/min/max functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean this one?
dataframe/core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/DataColumnType.kt
Line 74 in d8c6c4b
public fun AnyCol.valuesAreComparable(): Boolean = |
It's indeed a tad trickier, because Comparable
has a contravariant type argument. They way it's checked at runtime is by first checking whether the type of the column is Comparable
, then we get the common Comparable
type (this is the tricky part). If the result is Comparable<Nothing>
or Comparable<*>
we know the values are not comparable to each other.
This works because Comparable<in Pet> + Comparable<in Dog> -> Comparable<in Dog>
, but Comparable<in Dog> + Comparable<in Cat> -> Comparable<in Nothing>
(aka, you can compare dogs with other pets as long as they're also dogs, but you cannot compare dogs and cats).
So, you'd need to check whether you can do the same with the types at the compiler plugin level, or if you can get a KType, extract the logic to KTypes from :core :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valuesAreComparable should be impemented like this, to follow public fun <T : Comparable<T>> Iterable<T>.maxOrNull(): T?
signature.
public fun AnyCol.valuesAreComparable(): Boolean {
val other = Comparable::class.createType(listOf(KTypeProjection(KVariance.IN, type)))
return type.isSubtypeOf(other)
}
Now it's possible to break it with this code:
val i = 1
val i1: Comparable<Int> = object : Comparable<Int> {
override fun compareTo(other: Int): Int {
return other
}
}
val col by columnOf(i, i1)
dataFrameOf(col).max().print()
Because indeed, Comparable<Int>
does not implement Comparable<Comparable<Int>>
It's not possible to call listOf(i, i1).maxOrNull()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for plugin:
fun isInterComparable(col: SimpleDataColumn): Boolean {
val comparable = StandardClassIds.Comparable.constructClassLikeType(arrayOf(col.type.type))
return col.type.type.isSubtypeOf(comparable, session)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@koperagen I created a branch with this suggested fix, but it doesn't produce the expected results either https://github.com/Kotlin/dataframe/tree/values-are-comparable-fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@koperagen actually, looking better at it I think you're right and your fix is too (rewritten to this to account for nullability:
public fun AnyCol.valuesAreComparable(): Boolean =
isSubtypeOf(
Comparable::class.createType(
arguments = listOf(KTypeProjection(KVariance.IN, type.withNullability(false))),
nullable = hasNulls(),
),
)
)
There's some tests that need to be rewritten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, this should be the plugin version then:
fun isIntraComparable(col: SimpleDataColumn): Boolean {
val comparable = StandardClassIds.Comparable.constructClassLikeType(
typeArguments = arrayOf(col.type.type.withNullability(ConeNullability.NOT_NULL, session.typeContext)),
isNullable = col.type.type.isNullable,
)
return col.type.type.isSubtypeOf(comparable, session)
}
(didn't test this yet)
plugins/kotlin-dataframe/src/org/jetbrains/kotlinx/dataframe/plugin/impl/api/groupBy.kt
Outdated
Show resolved
Hide resolved
@Jolanrensen @koperagen please have a look at the supported scenarious, I want to merge it before #1078 changes |
val Arguments.columns: ColumnsResolver? by arg() | ||
|
||
override fun Arguments.interpret(): PluginDataFrameSchema { | ||
if (name == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a test case here:
personsDf.groupBy { city }.mean { age and yearsToRetirement }.compareSchemas()
Runtime:
city: String
mean: Double
Compile:
age: Double
yearsToRetirement: Double
city: String
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we agreed that support of multiple columns is under the question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, at least until after the statistics types are set in stone. That said, for mean
, the result will always be Double
:)
@@ -98,9 +100,12 @@ public inline fun <T, reified D : Number> DataFrame<T>.meanOf( | |||
// endregion | |||
|
|||
// region GroupBy | |||
|
|||
@Refine | |||
@Interpretable("GroupByMean1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call it GroupByMeanFor*
to distinguish from GroupBy.mean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, don't understand, GroupByMeanFor will be under mean/meanFor and GroupByMean0 will be under meanFor at that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest GroupByMeanDefault
for GroupBy.mean()
, GroupByMeanFor
for GroupBy.meanFor
and GroupByMean
for mean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, agree, will do some renaming/refactoring in next PR in that area
plugins/kotlin-dataframe/src/org/jetbrains/kotlinx/dataframe/plugin/impl/api/groupBy.kt
Outdated
Show resolved
Hide resolved
plugins/kotlin-dataframe/src/org/jetbrains/kotlinx/dataframe/plugin/impl/api/groupBy.kt
Outdated
Show resolved
Hide resolved
plugins/kotlin-dataframe/src/org/jetbrains/kotlinx/dataframe/plugin/impl/api/groupBy.kt
Outdated
Show resolved
Hide resolved
plugins/kotlin-dataframe/src/org/jetbrains/kotlinx/dataframe/plugin/impl/api/groupBy.kt
Outdated
Show resolved
Hide resolved
plugins/kotlin-dataframe/src/org/jetbrains/kotlinx/dataframe/plugin/impl/api/groupBy.kt
Outdated
Show resolved
Hide resolved
Replaced direct subtype checks with `isIntraComparable` to improve type safety when resolving columns. Updated documentation syntax for better consistency and clarity. Added schema comparison in test to validate grouping behavior.
Revised `GroupBy` aggregation logic by restructuring classes, improving naming consistency, and refining comments/documentation. Updated test cases to address initializer type mismatches and better handle scenarios involving multiple columns. Added relevant TODOs for unresolved cases linked to issue #1090.
Merged a failing build |
PR contains:
Tests for mean/std with results in Double doesn't work well (regarding incorrect resulting types), but please review the logic and keep test muted for a while
Closes #1061 #1074