Skip to content

Add Compiler Plugin support for statistics on GroupBy #1077

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Mar 11, 2025
Merged

Conversation

zaleslaw
Copy link
Collaborator

@zaleslaw zaleslaw commented Feb 25, 2025

PR contains:

  1. test scenarious for sum*/min/max/std/mean/median family on GroupBy
  2. ported test scenarious for Compiler Plugin
  3. Changes in Compiler Plugin

Tests for mean/std with results in Double doesn't work well (regarding incorrect resulting types), but please review the logic and keep test muted for a while

Closes #1061 #1074

Introduces the `GroupBySumOf` interpreter for aggregation, enabling the calculation of column sums with customizable expressions and result names in grouped DataFrames. Adds tests and updates APIs to support and validate this feature.
@zaleslaw zaleslaw requested a review from koperagen February 25, 2025 19:47
Updated statistical aggregation functions for GroupBy with comments addressing open questions. Added comprehensive tests to verify behavior across various statistics (sum, mean, median, std, min, and max), replacing older test cases for cleaner coverage.
@zaleslaw zaleslaw changed the title Add Compiler Plugin support for sum function Add Compiler Plugin support for statistics on GroupBy Feb 27, 2025
@zaleslaw zaleslaw marked this pull request as ready for review February 27, 2025 17:59
zaleslaw added 2 commits March 3, 2025 17:32
# Conflicts:
#	plugins/kotlin-dataframe/src/org/jetbrains/kotlinx/dataframe/plugin/loadInterpreter.kt
val resolvedColumns = receiver.groups.columns()
.filter {
it is SimpleDataColumn
&& it.type.type.isSubtypeOfComparable(session)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koperagen here, I am trying to check if types if comparable to emulate the interComparable() internal function behaviour for median/min/max functions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean this one?

public fun AnyCol.valuesAreComparable(): Boolean =

It's indeed a tad trickier, because Comparable has a contravariant type argument. They way it's checked at runtime is by first checking whether the type of the column is Comparable, then we get the common Comparable type (this is the tricky part). If the result is Comparable<Nothing> or Comparable<*> we know the values are not comparable to each other.

This works because Comparable<in Pet> + Comparable<in Dog> -> Comparable<in Dog>, but Comparable<in Dog> + Comparable<in Cat> -> Comparable<in Nothing> (aka, you can compare dogs with other pets as long as they're also dogs, but you cannot compare dogs and cats).

So, you'd need to check whether you can do the same with the types at the compiler plugin level, or if you can get a KType, extract the logic to KTypes from :core :).

Copy link
Collaborator

@koperagen koperagen Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valuesAreComparable should be impemented like this, to follow public fun <T : Comparable<T>> Iterable<T>.maxOrNull(): T? signature.

public fun AnyCol.valuesAreComparable(): Boolean {
    val other = Comparable::class.createType(listOf(KTypeProjection(KVariance.IN, type)))
    return type.isSubtypeOf(other)
}

Now it's possible to break it with this code:

val i = 1
    val i1: Comparable<Int> = object : Comparable<Int> {
        override fun compareTo(other: Int): Int {
            return other
        }
    }
    val col by columnOf(i, i1)
    dataFrameOf(col).max().print()

Because indeed, Comparable<Int> does not implement Comparable<Comparable<Int>>
It's not possible to call listOf(i, i1).maxOrNull()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for plugin:

fun isInterComparable(col: SimpleDataColumn): Boolean {
      val comparable = StandardClassIds.Comparable.constructClassLikeType(arrayOf(col.type.type))
      return col.type.type.isSubtypeOf(comparable, session)
  }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koperagen I created a branch with this suggested fix, but it doesn't produce the expected results either https://github.com/Kotlin/dataframe/tree/values-are-comparable-fix

Copy link
Collaborator

@Jolanrensen Jolanrensen Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koperagen actually, looking better at it I think you're right and your fix is too (rewritten to this to account for nullability:

public fun AnyCol.valuesAreComparable(): Boolean =
    isSubtypeOf(
        Comparable::class.createType(
            arguments = listOf(KTypeProjection(KVariance.IN, type.withNullability(false))),
            nullable = hasNulls(),
        ),
    )

)
There's some tests that need to be rewritten.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, this should be the plugin version then:

fun isIntraComparable(col: SimpleDataColumn): Boolean {
    val comparable = StandardClassIds.Comparable.constructClassLikeType(
        typeArguments = arrayOf(col.type.type.withNullability(ConeNullability.NOT_NULL, session.typeContext)),
        isNullable = col.type.type.isNullable,
    )
    return col.type.type.isSubtypeOf(comparable, session)
}

(didn't test this yet)

@zaleslaw
Copy link
Collaborator Author

zaleslaw commented Mar 5, 2025

@Jolanrensen @koperagen please have a look at the supported scenarious, I want to merge it before #1078 changes

val Arguments.columns: ColumnsResolver? by arg()

override fun Arguments.interpret(): PluginDataFrameSchema {
if (name == null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a test case here:
personsDf.groupBy { city }.mean { age and yearsToRetirement }.compareSchemas()

Runtime:
city: String
mean: Double

Compile:
age: Double
yearsToRetirement: Double
city: String

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we agreed that support of multiple columns is under the question?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, at least until after the statistics types are set in stone. That said, for mean, the result will always be Double :)

@@ -98,9 +100,12 @@ public inline fun <T, reified D : Number> DataFrame<T>.meanOf(
// endregion

// region GroupBy

@Refine
@Interpretable("GroupByMean1")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call it GroupByMeanFor* to distinguish from GroupBy.mean

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, don't understand, GroupByMeanFor will be under mean/meanFor and GroupByMean0 will be under meanFor at that way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest GroupByMeanDefault for GroupBy.mean(), GroupByMeanFor for GroupBy.meanFor and GroupByMean for mean

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, agree, will do some renaming/refactoring in next PR in that area

zaleslaw and others added 3 commits March 10, 2025 12:38
Replaced direct subtype checks with `isIntraComparable` to improve type safety when resolving columns. Updated documentation syntax for better consistency and clarity. Added schema comparison in test to validate grouping behavior.
Revised `GroupBy` aggregation logic by restructuring classes, improving naming consistency, and refining comments/documentation. Updated test cases to address initializer type mismatches and better handle scenarios involving multiple columns. Added relevant TODOs for unresolved cases linked to issue #1090.
@zaleslaw zaleslaw requested a review from koperagen March 10, 2025 18:30
@zaleslaw zaleslaw merged commit b60d6fc into master Mar 11, 2025
4 of 6 checks passed
@Jolanrensen
Copy link
Collaborator

Merged a failing build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add compiler plugin support for cumSum
3 participants