Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

“fix#829-PreAgg-field_doc_count” #839

Merged
merged 5 commits into from
Aug 8, 2022
Merged

Conversation

cwillum
Copy link
Contributor

@cwillum cwillum commented Aug 2, 2022

Signed-off-by: cwillum cwmmoore@amazon.com

Fixes #829

Description

Add documentation to Bucket Aggregation describing the use of the _doc_count field for computing documents that store pre-aggregated data.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: cwillum <cwmmoore@amazon.com>
@cwillum cwillum added 3 - Tech review PR: Tech review in progress v2.2.0 4 - Doc review PR: Doc review in progress labels Aug 2, 2022
@cwillum cwillum added this to the v2.2 milestone Aug 2, 2022
@cwillum cwillum self-assigned this Aug 2, 2022
@cwillum cwillum requested a review from a team as a code owner August 2, 2022 18:56
@cwillum
Copy link
Contributor Author

cwillum commented Aug 2, 2022

@petardz Thanks for your input on documentation for this issue. Could you review this content for technical accuracy? Thanks.

@@ -74,6 +74,98 @@ The `terms` aggregation requests each shard for its top 3 unique terms. The coor

This is especially true if `size` is set to a low number. Because the default size is 10, an error is unlikely to happen. If you don’t need high accuracy and want to increase the performance, you can reduce the size.

### Account for pre-aggregated data

While the `doc_count` field provides a representation of the number of individual documents aggregated in a bucket, the field by itself does not have a way to account for documents that store pre-aggregated data, such as `histogram`. To account for pre-aggregated data and accurately calculate the number of documents in a bucket, you can use the `_doc_count` field to add the number of documents in a single summary field. When a document includes the `_doc_count` field, all bucket aggregations recognize its value and increase the bucket `doc_count` proportionately. Keep these considerations in mind when using the `_doc_count` field:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rephrase this to simplify?
While the doc_count field represents the number of individual documents aggregated in a bucket, the field by itself does not account for documents that store pre-aggregated data, such as histogram.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also,
When a document includes the _doc_count field, all bucket aggregations increase the bucket doc_count by the value of _doc_count.

@Naarcha-AWS
Copy link
Collaborator

@petardz Thanks for your input on documentation for this issue. Could you review this content for technical accuracy? Will this backported to v1.3, v.2.0, and v2.1? Thanks.

@cwillum: Based on the conversation from this PR(opensearch-project/OpenSearch#3985) it seems like this code wasn't included in OpenSearch 1.0. Furthermore, the PR only contains a "backport 2.x" label, meaning that the change will be backported to the latest minor version of OpenSearch. We shouldn't need to backport on this one. Once the PR is approved and gone through editorial review, add the "5- Done and waiting to merge" label to this PR.

Signed-off-by: cwillum <cwmmoore@amazon.com>
@petardz
Copy link

petardz commented Aug 3, 2022

@cwillum I'm sorry, link with example I provided earlier is not compatible with opensearch. Here is a one synthetic example:

  1. Create index:
PUT my_index
{
  "settings": {
    "number_of_replicas": 0
  }, 
  "mappings" : {
    "properties" : {
      "str" : {
        "type" : "keyword"
      },
      "number" : {
        "type" : "integer"
      }
    }
  }
}
  1. Add few documents:
POST my_index/_doc
{
  "_doc_count": 10,
  "str": "abc",
  "number" : 500
}

POST my_index/_doc
{
  "_doc_count": 5,
  "str": "xyz",
  "number" : 100
}

POST my_index/_doc
{
  "_doc_count": 7,
  "str": "foo",
  "number" : 100
}
  1. Perform aggregations
POST my_index/_search
{ 
  "size" : 0,
  "aggs" : { 
    "num_terms" : { 
      "terms" : { 
        "field" : "number" 
      }
    }
  }
}

Response:


{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "num_terms" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 100,
          "doc_count" : 12
        },
        {
          "key" : 500,
          "doc_count" : 10
        }
      ]
    }
  }
}

Notice how _doc_count was used when calculating doc_count of buckets

@kolchfa-aws
Copy link
Collaborator

kolchfa-aws commented Aug 5, 2022

@cwillum: please use the following example:

PUT /my_index/_doc/1
{
  "response_code": 404,
  "date":"2022-08-05",
  "_doc_count": 20
}

PUT /my_index/_doc/2
{
  "response_code": 404,
  "date":"2022-08-06",
  "_doc_count": 10
}

PUT /my_index/_doc/3
{
  "response_code": 200,
  "date":"2022-08-06",
  "_doc_count": 300
}

GET /my_index/_search
{
  "size": 0,
  "aggs": {
    "response_codes": {
      "terms": {
        "field" : "response_code"
      }
    }
  }
}

Response:

{
  "took" : 20,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "response_codes" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 200,
          "doc_count" : 300
        },
        {
          "key" : 404,
          "doc_count" : 30
        }
      ]
    }
  }
}

Signed-off-by: cwillum <cwmmoore@amazon.com>
Signed-off-by: cwillum <cwmmoore@amazon.com>
Copy link
Collaborator

@kolchfa-aws kolchfa-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cwillum cwillum added the 5 - Editorial review PR: Editorial review in progress label Aug 5, 2022
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cwillum Just one comment for you. Let me know if you have any questions. Thanks!

* The field does not support nested arrays; only positive integers can be used.
* If a document does not contain the `_doc_count` field, aggregation uses the document to increase the count by 1.

OpenSearch features that rely on an accurate document count illustrate the importance of using the `_doc_count` field. To get a better sense for how this field can support other search functionality, see [Index rollups](https://opensearch.org/docs/latest/im-plugin/index-rollups/index/), an OpenSearch feature for the Index Management plugin that stores documents with pre-aggregated data in rollup indexes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "To get a better sense for" to "For information on" (or something similar). Change "Index Management" to "Index Management (IM)".

@natebower natebower removed the 5 - Editorial review PR: Editorial review in progress label Aug 8, 2022
Signed-off-by: cwillum <cwmmoore@amazon.com>
@cwillum cwillum removed 3 - Tech review PR: Tech review in progress 4 - Doc review PR: Doc review in progress labels Aug 8, 2022
@cwillum cwillum merged commit 07eda05 into main Aug 8, 2022
cwillum added a commit that referenced this pull request Aug 8, 2022
Naarcha-AWS pushed a commit that referenced this pull request Aug 8, 2022
Naarcha-AWS added a commit that referenced this pull request Aug 10, 2022
@Naarcha-AWS Naarcha-AWS deleted the fix#829-PreAgg_doc_count-field branch September 14, 2022 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] added _doc_count field to summary documents
6 participants