“fix#829-PreAgg-field_doc_count” #839

cwillum · 2022-08-02T18:56:33Z

Signed-off-by: cwillum cwmmoore@amazon.com

Fixes #829

Description

Add documentation to Bucket Aggregation describing the use of the _doc_count field for computing documents that store pre-aggregated data.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: cwillum <cwmmoore@amazon.com>

cwillum · 2022-08-02T18:59:33Z

@petardz Thanks for your input on documentation for this issue. Could you review this content for technical accuracy? Thanks.

kolchfa-aws · 2022-08-02T19:22:42Z

_opensearch/bucket-agg.md

@@ -74,6 +74,98 @@ The `terms` aggregation requests each shard for its top 3 unique terms. The coor

 This is especially true if `size` is set to a low number. Because the default size is 10, an error is unlikely to happen. If you don’t need high accuracy and want to increase the performance, you can reduce the size.

+### Account for pre-aggregated data
+
+While the `doc_count` field provides a representation of the number of individual documents aggregated in a bucket, the field by itself does not have a way to account for documents that store pre-aggregated data, such as `histogram`. To account for pre-aggregated data and accurately calculate the number of documents in a bucket, you can use the `_doc_count` field to add the number of documents in a single summary field. When a document includes the `_doc_count` field, all bucket aggregations recognize its value and increase the bucket `doc_count` proportionately. Keep these considerations in mind when using the `_doc_count` field:


Can we rephrase this to simplify?
While the doc_count field represents the number of individual documents aggregated in a bucket, the field by itself does not account for documents that store pre-aggregated data, such as histogram.

Also,
When a document includes the _doc_count field, all bucket aggregations increase the bucket doc_count by the value of _doc_count.

Naarcha-AWS · 2022-08-02T19:25:04Z

@petardz Thanks for your input on documentation for this issue. Could you review this content for technical accuracy? Will this backported to v1.3, v.2.0, and v2.1? Thanks.

@cwillum: Based on the conversation from this PR(opensearch-project/OpenSearch#3985) it seems like this code wasn't included in OpenSearch 1.0. Furthermore, the PR only contains a "backport 2.x" label, meaning that the change will be backported to the latest minor version of OpenSearch. We shouldn't need to backport on this one. Once the PR is approved and gone through editorial review, add the "5- Done and waiting to merge" label to this PR.

Signed-off-by: cwillum <cwmmoore@amazon.com>

petardz · 2022-08-03T20:04:53Z

@cwillum I'm sorry, link with example I provided earlier is not compatible with opensearch. Here is a one synthetic example:

Create index:

PUT my_index
{
  "settings": {
    "number_of_replicas": 0
  }, 
  "mappings" : {
    "properties" : {
      "str" : {
        "type" : "keyword"
      },
      "number" : {
        "type" : "integer"
      }
    }
  }
}

Add few documents:

POST my_index/_doc
{
  "_doc_count": 10,
  "str": "abc",
  "number" : 500
}

POST my_index/_doc
{
  "_doc_count": 5,
  "str": "xyz",
  "number" : 100
}

POST my_index/_doc
{
  "_doc_count": 7,
  "str": "foo",
  "number" : 100
}

Perform aggregations

POST my_index/_search
{ 
  "size" : 0,
  "aggs" : { 
    "num_terms" : { 
      "terms" : { 
        "field" : "number" 
      }
    }
  }
}

Response:

 {
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "num_terms" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 100,
          "doc_count" : 12
        },
        {
          "key" : 500,
          "doc_count" : 10
        }
      ]
    }
  }
}

Notice how _doc_count was used when calculating doc_count of buckets

kolchfa-aws · 2022-08-05T15:56:20Z

@cwillum: please use the following example:

PUT /my_index/_doc/1
{
  "response_code": 404,
  "date":"2022-08-05",
  "_doc_count": 20
}

PUT /my_index/_doc/2
{
  "response_code": 404,
  "date":"2022-08-06",
  "_doc_count": 10
}

PUT /my_index/_doc/3
{
  "response_code": 200,
  "date":"2022-08-06",
  "_doc_count": 300
}

GET /my_index/_search
{
  "size": 0,
  "aggs": {
    "response_codes": {
      "terms": {
        "field" : "response_code"
      }
    }
  }
}

Response:

{
  "took" : 20,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "response_codes" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 200,
          "doc_count" : 300
        },
        {
          "key" : 404,
          "doc_count" : 30
        }
      ]
    }
  }
}

Signed-off-by: cwillum <cwmmoore@amazon.com>

kolchfa-aws

LGTM

natebower

@cwillum Just one comment for you. Let me know if you have any questions. Thanks!

natebower · 2022-08-08T13:28:02Z

_opensearch/bucket-agg.md

+* The field does not support nested arrays; only positive integers can be used.
+* If a document does not contain the `_doc_count` field, aggregation uses the document to increase the count by 1.
+
+OpenSearch features that rely on an accurate document count illustrate the importance of using the `_doc_count` field. To get a better sense for how this field can support other search functionality, see [Index rollups](https://opensearch.org/docs/latest/im-plugin/index-rollups/index/), an OpenSearch feature for the Index Management plugin that stores documents with pre-aggregated data in rollup indexes.


Change "To get a better sense for" to "For information on" (or something similar). Change "Index Management" to "Index Management (IM)".

Signed-off-by: cwillum <cwmmoore@amazon.com>

This reverts commit 07eda05.

This reverts commit 021999f.

“fix#829-PreAgg-field_doc_count”

f3541d3

Signed-off-by: cwillum <cwmmoore@amazon.com>

cwillum added 3 - Tech review PR: Tech review in progress v2.2.0 4 - Doc review PR: Doc review in progress labels Aug 2, 2022

cwillum added this to the v2.2 milestone Aug 2, 2022

cwillum self-assigned this Aug 2, 2022

cwillum requested a review from a team as a code owner August 2, 2022 18:56

cwillum requested a review from kolchfa-aws August 2, 2022 19:01

kolchfa-aws reviewed Aug 2, 2022

View reviewed changes

“fix#829-PreAgg-field_doc_count”

882ffb6

Signed-off-by: cwillum <cwmmoore@amazon.com>

cwillum added 2 commits August 5, 2022 09:31

“fix#829-PreAgg-field_doc_count”

9e6ea0b

Signed-off-by: cwillum <cwmmoore@amazon.com>

“fix#829-PreAgg-field_doc_count”

055af42

Signed-off-by: cwillum <cwmmoore@amazon.com>

kolchfa-aws approved these changes Aug 5, 2022

View reviewed changes

cwillum added the 5 - Editorial review PR: Editorial review in progress label Aug 5, 2022

natebower reviewed Aug 8, 2022

View reviewed changes

natebower removed the 5 - Editorial review PR: Editorial review in progress label Aug 8, 2022

“fix#829-PreAgg-field_doc_count”

9af08d2

Signed-off-by: cwillum <cwmmoore@amazon.com>

Naarcha-AWS approved these changes Aug 8, 2022

View reviewed changes

cwillum removed 3 - Tech review PR: Tech review in progress 4 - Doc review PR: Doc review in progress labels Aug 8, 2022

JeffHuss approved these changes Aug 8, 2022

View reviewed changes

cwillum merged commit 07eda05 into main Aug 8, 2022

cwillum added a commit that referenced this pull request Aug 8, 2022

Revert "“fix#829-PreAgg-field_doc_count” (#839)"

c10a911

This reverts commit 07eda05.

cwillum mentioned this pull request Aug 8, 2022

Revert "“fix#829-PreAgg-field_doc_count”" #860

Merged

Naarcha-AWS pushed a commit that referenced this pull request Aug 8, 2022

Revert "“fix#829-PreAgg-field_doc_count” (#839)" (#860)

021999f

This reverts commit 07eda05.

Naarcha-AWS added a commit that referenced this pull request Aug 10, 2022

Revert "Revert "“fix#829-PreAgg-field_doc_count” (#839)" (#860)"

ffc4480

This reverts commit 021999f.

Naarcha-AWS deleted the fix#829-PreAgg_doc_count-field branch September 14, 2022 17:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

“fix#829-PreAgg-field_doc_count” #839

“fix#829-PreAgg-field_doc_count” #839

cwillum commented Aug 2, 2022 •

edited by Naarcha-AWS

Loading

cwillum commented Aug 2, 2022 •

edited

Loading

kolchfa-aws Aug 2, 2022

kolchfa-aws Aug 2, 2022

Naarcha-AWS commented Aug 2, 2022

petardz commented Aug 3, 2022 •

edited

Loading

kolchfa-aws commented Aug 5, 2022 •

edited

Loading

kolchfa-aws left a comment

natebower left a comment

natebower Aug 8, 2022

“fix#829-PreAgg-field_doc_count” #839

“fix#829-PreAgg-field_doc_count” #839

Conversation

cwillum commented Aug 2, 2022 • edited by Naarcha-AWS Loading

Description

cwillum commented Aug 2, 2022 • edited Loading

kolchfa-aws Aug 2, 2022

Choose a reason for hiding this comment

kolchfa-aws Aug 2, 2022

Choose a reason for hiding this comment

Naarcha-AWS commented Aug 2, 2022

petardz commented Aug 3, 2022 • edited Loading

kolchfa-aws commented Aug 5, 2022 • edited Loading

kolchfa-aws left a comment

Choose a reason for hiding this comment

natebower left a comment

Choose a reason for hiding this comment

natebower Aug 8, 2022

Choose a reason for hiding this comment

cwillum commented Aug 2, 2022 •

edited by Naarcha-AWS

Loading

cwillum commented Aug 2, 2022 •

edited

Loading

petardz commented Aug 3, 2022 •

edited

Loading

kolchfa-aws commented Aug 5, 2022 •

edited

Loading