Skip to content

C (compression) and M (maxDiscrete) parameters return the same results #16

Open
@velwu

Description

@velwu

Greetings, Mr. Erlandson (@erikerlandson),

The team I represent recently started using this solution and we've been trying to tune accuracy and performance via
different C (compression) and M (maxDiscrete) parameters.

In our case, we had a dataframe called "originDf" which has 3 columns ("id": String, "fruit_type": String, "count": Integer)

+------+--------------------+-----------+
|id |fruit_type |count |
+------+--------------------+-----------+
|001|Apples |2 |
|002|Apricots |79 |
|001|Avocados |4 |
|003|Watermelon |13 |
|007|Blueberries |5 |
|007|Cherries |6 |
|007|Clementine |41 |
|007|Cucumbers |5 |
|007|Elderberry |3 |
|007|Eggfruit |1 |
|008|Eggfruit |19 |
|012|Clementine |61 |
|013|Blueberries |21 |
|014|Blueberries |4 |
|...|Lime |4 |
|...|Rambutan |3 |
|...|Strawberries |6 |
|...|Watermelon |5 |
|...|Tangerine |3 |
|...|Tangerine |6 |
+------+--------------------+-----------+
This dataframe has roughly 0.2 billion rows in total.

We then did the following:

val t_digest_udf1= TDigestAggregator.udf[Double](compression = C, maxDiscrete = M)
val groupDf1 = originDf.groupBy(col("fruit_type")).agg(t_digest_udf1(col("count")) as "t_digests",)
val udf1 = groupDf1.first()
val t1 = udf1.getAsTDigest

where (C, M) is respectively (100, 100), (0.01, 100), (100, 10000), and the resulting single TDigest from each set as t1, t2, and t3

t1: org.isarnproject.sketches.java.TDigest = TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)

t2: org.isarnproject.sketches.java.TDigest = TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)

t3: org.isarnproject.sketches.java.TDigest = TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)

It turns out t1, t2 & t3 (all of the org.isarnproject.sketches.java.TDigest class) look exactly the same value-wise.
Though Boolean checks such as t1.equal(t2) all return false, which indicates these are different TDigest entities somehow.

Do you see anything off in this usage? Did we use the TDigest correctly, and more importantly, is our understanding of the algorithm correct?

Thank you and we look forward to your responses.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions