C (compression) and M (maxDiscrete) parameters return the same results

Greetings, Mr. Erlandson (@erikerlandson),

The team I represent recently started using this solution and we've been trying to tune accuracy and performance via 
different C (compression) and M (maxDiscrete) parameters.

In our case, we had a dataframe called "originDf" which has 3 columns ("id": String, "fruit_type": String, "count": Integer)

+------+--------------------+-----------+             
|id       |fruit_type			|count		|
+------+--------------------+-----------+
|001|Apples			|2          	|
|002|Apricots         |79        	|
|001|Avocados      |4          	|
|003|Watermelon	|13        	|
|007|Blueberries    |5          	|
|007|Cherries			|6          	|
|007|Clementine	|41        	|
|007|Cucumbers	|5          	|
|007|Elderberry      |3          	|
|007|Eggfruit    		|1          	|
|008|Eggfruit    		|19        	|
|012|Clementine    |61        	|
|013|Blueberries	|21        	|
|014|Blueberries    |4          	|
|...|Lime       		|4          	|
|...|Rambutan		|3          	|
|...|Strawberries  |6          	|
|...|Watermelon   |5          	|
|...|Tangerine      |3          	|
|...|Tangerine      |6          	|
+------+--------------------+-----------+
This dataframe has roughly 0.2 billion rows in total.

We then did the following:

val t_digest_udf1= TDigestAggregator.udf[Double](compression = C, maxDiscrete = M) 
val groupDf1 = originDf.groupBy(col("fruit_type")).agg(t_digest_udf1(col("count")) as "t_digests",) 
val udf1 = groupDf1.first()
val t1 = udf1.getAs[TDigest](1)

where (C, M) is respectively (100, 100), (0.01, 100), (100, 10000), and the resulting single TDigest from each set as t1, t2, and t3

t1: org.isarnproject.sketches.java.TDigest = TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)

t2: org.isarnproject.sketches.java.TDigest =  TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)

t3: org.isarnproject.sketches.java.TDigest =  TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)

It turns out t1, t2 & t3 (all of the org.isarnproject.sketches.java.TDigest class) look exactly the same value-wise.
Though Boolean checks such as t1.equal(t2) all return false, which indicates these are different TDigest entities somehow.

Do you see anything off in this usage? Did we use the TDigest correctly, and more importantly, is our understanding of the algorithm correct?

Thank you and we look forward to your responses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C (compression) and M (maxDiscrete) parameters return the same results #16

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

C (compression) and M (maxDiscrete) parameters return the same results #16

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions