Skip to content

sum_of_squares calculation and docs don't align #50416

Closed
@jmceniery

Description

@jmceniery

Potentially a link in the docs to a Wikipedia article is incorrect and should be removed, or the function to calculate the intended usage of the sum of squares is incorrect and needs to be updated.

Describe the feature:
Expecting a sum_of_squares output in the Extended Stats Aggregation to align with the formula provided by Wikipedia and many other statistical sources.

The current calculation for the Extended Stats Aggregation for the sum_of_squares is calculated using the following equation: sumOfSquares += value * value; The docs references a Wikipedia article which provides a different function.

image

Elasticsearch version: Tested on 6.5.4, 7.5.0

Description of the problem including expected versus actual behavior:
Current calculation of the sum of squares does not align to the statistical technique used to calculate the sum of squares.

Sum of squares is a statistical technique used in regression analysis to determine the dispersion of data points. In a regression analysis, the goal is to determine how well a data series can be fitted to a function that might help to explain how the data series was generated. Sum of squares is used as a mathematical way to find the function that best fits (varies least) from the data.

Many sum of squares calculators do not align to the way the sum of
Steps to reproduce:

List of Numbers: 74.01,74.77,73.94,73.61,73.40
Expected outcome:

SS = (74.01 - 73.95)2 + (74.77 - 73.95)2 + (73.94 - 73.95)2 + (73.61 - 73.95)2 + (73.40 - 73.95)2
SS = (0.06) 2 + (0.82)2 + (-0.01)2 + (-0.34)2 + (-0.55)2
SS = 1.0942

Actual Outcome:
Elastic looks to be using the following formula to calculate the sum_of_squares:

SS = (74.01 )2 + (74.77)2 + (73.94)2 + (73.61)2 + (73.40)2
SS = 5477.4801 + 5590.5529 + 5467.1236 + 5418.4321 + 5387.56
SS = 27341.1487

Recreate:
Created an index:

PUT /sum_of_squares_test_2
{
    "settings" : {
        "index" : {
            "number_of_shards" : 1, 
            "number_of_replicas" : 1
        }
    }
}

Add some Docs

POST /sum_of_squares_test_2/_doc/
{
  "grade": 74.01
}

POST /sum_of_squares_test_2/_doc
{
  "grade": 74.77
}

POST /sum_of_squares_test_2/_doc
{
  "grade": 73.94
}

POST /sum_of_squares_test_2/_doc
{
  "grade": 73.61
}

POST /sum_of_squares_test_2/_doc
{
  "grade": 73.40
}

Search the index:

GET sum_of_squares_test_2/_search
{
  "size":0,
  "aggs":{
    "grade_stats":{
      "extended_stats":{
        "field":"grade"
      }
    }
  }
}

Response:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "grade_stats" : {
      "count" : 5,
...
      "sum_of_squares" : 27341.149189099146,
...
      }
    }
  }
}

Search the index using SQL

POST /_xpack/sql?format=txt
{
    "query": "SELECT SUM_OF_SQUARES(grade) AS sumsq FROM sum_of_squares_test_2"
}

Response:

      sumsq       
------------------
27341.149189099146

Can the statistical method also be added if the current method is as expected. The link in the docs will need to be removed if the current method is correct. I am happy to put in the PR once I have the clarification.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions