Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many fields in template's default_fields #14262

Closed
andrewkroh opened this issue Oct 28, 2019 · 21 comments · Fixed by #14341
Closed

Too many fields in template's default_fields #14262

andrewkroh opened this issue Oct 28, 2019 · 21 comments · Fixed by #14341
Labels
discuss Issue needs further discussion. Filebeat Filebeat needs_team Indicates that the issue/PR needs a Team:* label Stalled

Comments

@andrewkroh
Copy link
Member

In Filebeat we are close to going over 1024 fields in the default_field setting in Elasticsearch index template. This issue could affect other Beats too in the future (most likely Metricbeat). This will cause certain queries to the index to fail with an exception like:

"caused_by": {
      "type": "illegal_argument_exception",
      "reason": "field expansion matches too many fields, limit: 1024, got: 1293"
}

In Beats when the index template is generated it automatically adds all text and keyword fields to the default_field list.

addToDefaultFields(&field)

We need a plan to deal with the growing number of fields in default_field. This issue is causing problems for me because I'm adding fields from CEF to the fields.yml.

@andrewkroh andrewkroh added discuss Issue needs further discussion. Filebeat Filebeat labels Oct 28, 2019
@andrewkroh
Copy link
Member Author

I propose adding a new optional setting to the fields.yml file that will allow specifying that a field, or a group of fields, should not be included in the default_field. This will keep the existing behavior for now (added by default) and allow developers to exclude if needed.

        - name: extensions
          type: group
          include_in_default_field: false
          description: >
            Collection of key-value pairs carried in the CEF extension field.
          fields:
            - name: agentAddress
              type: ip
              description: The IP address of the ArcSight connector that processed the event.

Any suggestions on naming for the param? I'm not too keen on include_in_default_field, but it's what I thought of initially.

@ruflin @tsg Thoughts?

@ruflin
Copy link
Member

ruflin commented Oct 28, 2019

Good you found this, was not aware of the limit. And it seems it can only be changed on query time: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-top-level-params Does this mean already today if someone queries across metricbeat-*,filebeat-* he would hit this limit? If yes, this would probably have wider implications. The good news is, that it looks like with the new package manager we will have 1 index per input, so the number of fields overall will be much lower.

I like the proposed solution as a short term solution but I think we need also a more scalable long term to follow up with.

Naming: What about just using default_field: true?

@skh This is also relavant for you moving the logic to Kibana.

@andrewkroh
Copy link
Member Author

Filebeat is actually broken now in master. I thought it was my additions causing the problem but in master we already have 1175 default fields.

+1 on default_field as the name. I have added that logic and it's working for me. I now need to go find the fields that were recently added in master and remove some of them from the default field list to get the number back below 1024.

Longer term if we move to an index per module indexing strategy then this should not be a problem.

@ph
Copy link
Contributor

ph commented Oct 29, 2019

@andrewkroh sound like a good strategy / bandaid. We are on borrowed time here, maybe we should also goes through existing modules in filebeat and see if any fields could be removed from the list.

also +1 on naming default_field

@ph
Copy link
Contributor

ph commented Oct 29, 2019

Just adding a note here so do not lose track, it's possible that 7.x is broken too in #14298

Also, that error is probably testable by just doing a query on Elasticsearch OR check the number of f default fields in the generated template. So we need to add a guard on theses cases.

@ruflin
Copy link
Member

ruflin commented Oct 30, 2019

@ph We should definitively add something like this to our CI. It's kind of bad that we only realised it 151 fields too late.

@ph
Copy link
Contributor

ph commented Oct 30, 2019

@ruflin yes, what concern me is, that error is only raised at query time. Maybe it should also validated when we push the template? Maybe its because users can change that limit on the fly they do not validate it at insert.

@ruflin
Copy link
Member

ruflin commented Oct 30, 2019

@ph I think the argument here is that it is a query time parameter. So an index with 3000 default_fields is totally fine as long as the query param will be adjusted. Also 3000 fields could come from multiple indices with multiple templates. But I share your sentiment, that perhaps there should be a more global option around how many default_fields in a template are allowed and a warning / error in this case.

andrewkroh added a commit to andrewkroh/beats that referenced this issue Oct 30, 2019
The number of fields in the Elasticsearch index template's `settings.index.query.default_field` option has grown over time, and is now greater than 1024 in Filebeat (Elastic licensed version). This causes queries to Elasticsearch to fail when a list of fields is not specified because there is a default limit of 1024 in Elasticsearch.

This adds a new setting to fields.yml called `default_field` whose value can be true/false (defaults to true). When true the text/keyword fields are added to the `default_field` list (as was the behavior before this change). And when set to false the field is omitted from the default_field list.

This adds a test for every beat to check if the default_field list contains more than 1000 fields. The limit is a little less than 1024 because `fields.*` is in the default_field list already and at query time that wildcard will be expanded and count toward the limit.

Fixes elastic#14262
andrewkroh added a commit that referenced this issue Oct 31, 2019
* Add default_field option to fields.yml

The number of fields in the Elasticsearch index template's `settings.index.query.default_field` option has grown over time, and is now greater than 1024 in Filebeat (Elastic licensed version). This causes queries to Elasticsearch to fail when a list of fields is not specified because there is a default limit of 1024 in Elasticsearch.

This adds a new setting to fields.yml called `default_field` whose value can be true/false (defaults to true). When true the text/keyword fields are added to the `default_field` list (as was the behavior before this change). And when set to false the field is omitted from the default_field list.

This adds a test for every beat to check if the default_field list contains more than 1000 fields. The limit is a little less than 1024 because `fields.*` is in the default_field list already and at query time that wildcard will be expanded and count toward the limit.

Fixes #14262

* Exclude new zeek datasets from default_field list
andrewkroh added a commit to andrewkroh/beats that referenced this issue Nov 22, 2019
* Add default_field option to fields.yml

The number of fields in the Elasticsearch index template's `settings.index.query.default_field` option has grown over time, and is now greater than 1024 in Filebeat (Elastic licensed version). This causes queries to Elasticsearch to fail when a list of fields is not specified because there is a default limit of 1024 in Elasticsearch.

This adds a new setting to fields.yml called `default_field` whose value can be true/false (defaults to true). When true the text/keyword fields are added to the `default_field` list (as was the behavior before this change). And when set to false the field is omitted from the default_field list.

This adds a test for every beat to check if the default_field list contains more than 1000 fields. The limit is a little less than 1024 because `fields.*` is in the default_field list already and at query time that wildcard will be expanded and count toward the limit.

Fixes elastic#14262

* Exclude new zeek datasets from default_field list

(cherry picked from commit 9f21b96)
@andrewkroh andrewkroh reopened this Dec 19, 2019
@andrewkroh
Copy link
Member Author

andrewkroh commented Dec 19, 2019

Re-opening this because we only solved it with a temporary solution that comes with quite a few maintainability problems. The current fix for this was to just not add any new fields to the default_field list. So any new fields must be explicilty marked in the fields.yml with default_field: false.

This also has implications on ECS when we want to update to a new fields.ecs.yml because we have to make sure any new ECS fields are not added to default_field. This is rippling into the elastic/ecs repo as a result. See elastic/ecs#687.

Are there any proposals for a more permanent fix? Will we move to a different indexing strategy where the number of fields is less of an issue?

@webmat
Copy link
Contributor

webmat commented Dec 20, 2019

elastic/ecs#687 now automatically sets default_fields: false on any field not part of a whitelist file. It's currently populated only with ECS 1.2 fields, as was the original intention of the PR. But if the Beats team wants to strategically whitelist a few more fields, we only need to add them to that whitelist file.

This workaround in ECS should also be considered a temporary workaround, IMO. Otherwise that would mean any field added from here to the end of the 7.x line wouldn't be added to default_field.

I think one of the non-breaking ways we could address this (but only wrt the ECS fields) is consider culling some ECS fields that made it to default_field that aren't actually useful in some of the Beats. Right now all of ECS is imported in all of the Beats. But for example I would never expect dns.* or tls.* to ever show up in Metricbeat. So I think a careful look at each Beat could lead to a cleanup of many fields -- different per Beat -- that can be removed from the current default_field setting their templates.

Note that there's a broader point that these unrelated ECS field field definitions could also be removed entirely from each Beat's template, but that's another issue.

webmat pushed a commit to elastic/ecs that referenced this issue Dec 23, 2019
This is so that Beats' default_fields don't go above 1024 field limit. See also elastic/beats#14262
webmat pushed a commit to webmat/ecs that referenced this issue Dec 23, 2019
…c#687)

This is so that Beats' default_fields don't go above 1024 field limit. See also elastic/beats#14262
webmat pushed a commit to webmat/ecs that referenced this issue Dec 23, 2019
…c#687)

This is so that Beats' default_fields don't go above 1024 field limit. See also elastic/beats#14262
webmat pushed a commit to elastic/ecs that referenced this issue Dec 23, 2019
…709)

This is so that Beats' default_fields don't go above 1024 field limit. See also elastic/beats#14262
webmat pushed a commit to elastic/ecs that referenced this issue Dec 23, 2019
…710)

This is so that Beats' default_fields don't go above 1024 field limit. See also elastic/beats#14262
@ruflin
Copy link
Member

ruflin commented Dec 23, 2019

@andrewkroh Indexing wise we will switch to have 1 index per dataset so the total number of fields will be heavily reduced (as long as ECS doesn't have too many fields ;-) ). But this is the long term solution. As we keep adding modules on the Beats side at the moment, I think we also need a mid term solution for this. Perhaps we should loop in someone from the Elasticsearch team like @jpountz to get some input?

@collinbachi
Copy link

collinbachi commented Jan 16, 2020

I ran into this today, while trying to figure out why one of my fields wasn't queryable.

Eventually I resolved it by removing several hundred fields I wasn't using from the filebeat index template, then updating default_field to *.

I was initially surprised to see the 1024 limit come into play, since I was using ~50 fields from filebeat, and allowing ~100 or so of my own to be dynamically indexed.

(Sorry if I'm using the wrong terminology, elasticsearch is hard :) )

@jpountz
Copy link
Contributor

jpountz commented Jan 20, 2020

@ruflin Sorry I had missed your ping. Lucene and Elasticsearch are both not designed for the case that many fields exist but don't have any data and would be a lot of work. Improving the situation for this case doesn't feel like the right trade-off given that we're moving to one index per module.

I wonder whether improving defaults could make maintenance easier. For instance, float/double/scaled_float fields are usually not useful for default_fields, and it's also generally not needed to add both a text field and its sub keyword fields to this list, could we try to only add the text field in that case?

@ruflin
Copy link
Member

ruflin commented Jan 21, 2020

@jpountz Having one dataset per index will solve the issue. Unfortunately we are not there yet and we have the above issue at the moment with Filebeat and Metricbeat. We already have quite a bit of logic / magic on what fields get added to default_fields and what not. We already exclude all the "number" fields. We once reverted back to only use text or keyword but then realised ip fields are pretty important. So if someone types 192.168.1.1 he searches across all the ip fields.

@MorrieAtElastic
Copy link

How difficult would it be to enable logic so that templates are only enabled for those Beats modules which have been selected by the user, and does that present the potential for a long-term solution to this issue?

@ruflin
Copy link
Member

ruflin commented Feb 13, 2020

@MorrieAtElastic Unfortunately not an easy problem. But we tackle exactly that with the new Elastic Package Manager which means, Beats doesn't have to do any setups anymore.

dcode pushed a commit to dcode/ecs that referenced this issue Apr 15, 2020
…c#687)

This is so that Beats' default_fields don't go above 1024 field limit. See also elastic/beats#14262
jorgemarey pushed a commit to jorgemarey/beats that referenced this issue Jun 8, 2020
* Add default_field option to fields.yml

The number of fields in the Elasticsearch index template's `settings.index.query.default_field` option has grown over time, and is now greater than 1024 in Filebeat (Elastic licensed version). This causes queries to Elasticsearch to fail when a list of fields is not specified because there is a default limit of 1024 in Elasticsearch.

This adds a new setting to fields.yml called `default_field` whose value can be true/false (defaults to true). When true the text/keyword fields are added to the `default_field` list (as was the behavior before this change). And when set to false the field is omitted from the default_field list.

This adds a test for every beat to check if the default_field list contains more than 1000 fields. The limit is a little less than 1024 because `fields.*` is in the default_field list already and at query time that wildcard will be expanded and count toward the limit.

Fixes elastic#14262

* Exclude new zeek datasets from default_field list
@willemdh
Copy link

willemdh commented Aug 5, 2020

We are currently unable to do wildcard and regex lucene searches in filebeat-* due to this problem.

**
image
**

Elastic supprt tells me: "the developers say that is too many fields"

But we are using the provided Elastic Filebeat modules and templates.....?? What are you expecting from us? That we index filebeat datasets to non filebeat-* indices? The result is builtin dashboards etc will stop working.

See case 00571318

Increasing indices.query.bool.max_clause_count seem like the only decent solution.... But this conflicts with "Higher values can lead to performance degradations and memory issues, especially in clusters with a high load or few resources.", as documented in Search Settings

So just to be clear, is this issue only related to the default fields in the filebeat template? If so, why not remove some from the filebeat template untill a better solution is implementable...? Also, it contains some * fields, which makes the total number of default fields even more inpredictable?

@jsoriano
Copy link
Member

We are adding many fields with name text to the template's default_fields, this is probably unexpected, if we need a text default field we should probably add it only once:

From filebeat export template:

          "process.args",
          "text",
          "process.executable",
          "process.hash.md5",
          "process.hash.sha1",
          "process.hash.sha256",
          "process.hash.sha512",
          "process.name",
          "text",
          "text",
          "text",
          "text",
          "text",
          "process.thread.name",

@webmat
Copy link
Contributor

webmat commented Aug 12, 2020

This may be introduced by the .text multi-fields that have been added in many places.

I think this is a bug in the script that takes the field definitions and generates the default_fields.

@botelastic
Copy link

botelastic bot commented Jul 13, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@botelastic botelastic bot added Stalled needs_team Indicates that the issue/PR needs a Team:* label labels Jul 13, 2021
@botelastic
Copy link

botelastic bot commented Jul 13, 2021

This issue doesn't have a Team:<team> label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issue needs further discussion. Filebeat Filebeat needs_team Indicates that the issue/PR needs a Team:* label Stalled
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants