Skip to content

Add request parameter to block index auto creation #34737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

vladimirdolzhenko
Copy link
Contributor

@vladimirdolzhenko vladimirdolzhenko commented Oct 23, 2018

Node setting action.auto_create_index allows to create index implicitly on bulk/reindex/update requests. Some cases require that index to be created in advance while it is still handy to have it turned on for the rest cases.

For that purposes it is worth to have an option to disable automatic index creation on a request level.

This PR adds request parameter auto_create_index to index/update/delete requests, to disable automatic index creation it has to be set to false if it is enabled by action.auto_create_index.
Field auto_create_index is added to action and meta data of bulk request.

Note: parameter auto_create_index can never be set to true

@vladimirdolzhenko vladimirdolzhenko added >enhancement :Data Management/Indices APIs APIs to create and manage indices and templates labels Oct 23, 2018
@vladimirdolzhenko vladimirdolzhenko self-assigned this Oct 23, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@vladimirdolzhenko
Copy link
Contributor Author

Note: the name of parameter (disable_auto_create_index) is still under discussion.

@vladimirdolzhenko
Copy link
Contributor Author

Relates to #34649: for the new custom exception message I include the index name into exception message.

@vladimirdolzhenko vladimirdolzhenko force-pushed the allow_auto_create_index_param branch from ffc6f0b to ef7c00f Compare October 23, 2018 14:37
@jasontedor
Copy link
Member

@vladimirdolzhenko Would you explain the motivation for proposing this change? I think that we have seen enhancements requests for this functionality before and turned them down (I will search for it later).

@vladimirdolzhenko
Copy link
Contributor Author

@jasontedor It has been asked by @bleskes

@jasontedor
Copy link
Member

@vladimirdolzhenko Okay, can you articulate what the need is?

@bleskes
Copy link
Contributor

bleskes commented Oct 24, 2018

Current shippers like beats default to bootstrap their connections by adding an index templates and relying on automatic index creation to create indices based on those templates. This was introduced in order to address problems where people forgot to run the setup command and started indexing, only to later find out that their data is no good. This default behavior works well and have solved the problem for which it is created, but there still two issues on that front:

  1. If beats doesn't ship data to ES (i.e., it ships to Kafka or logstash), people still need to bootstrap ES manually and they forget.
  2. Some people rely on external provisioning systems to configure ES and disable beats' auto setup features. This sometimes goes wrong and they end up with data in ES they need to clean up.

The ability to make sure that a request never creates an index (i.e., I only want this data to be indexed, if possible) will help beats address these issues when people use ILM (where an index needs to be explicitly created with a write alias).

Next to the above, such a flag may simplify using index templates in use cases where a single index is used for storage (like kibana and security). We current add a template for the purpose of protecting against people deleting an existing index and an inflight indexing request from creating a new one without a mapping. This flag would allow to fail the request and fall back to explicit index creation.

Last, and this is still very much in the air, there is chance the configuring the templates/data environment will move away from the edge shippers to some other central point, which strengthens the problems mentioned above. This too far out to effect current decisions but just put this on the horizon.

@bleskes
Copy link
Contributor

bleskes commented Oct 24, 2018

/cc @tsg

@clintongormley
Copy link
Contributor

I think that we have seen enhancements requests for this functionality before and turned them down (I will search for it later).

We discussed it briefly in #23685. I have since come to the conclusion that this is the only clean way to avoid the race condition around putting the index template and indexing the first document. It also has the beneficial side effect of reducing the need for templates (in favour of direct index creation).

@tomcallahan
Copy link
Contributor

Where do stand on this one? @vladimirdolzhenko @bleskes @jasontedor

@jaymode
Copy link
Member

jaymode commented Nov 5, 2018

This change would also benefit a cleanup in security. Today we try to be smart and make sure the index exists with the correct mappings before issuing a write but this is not foolproof so we also have to handle some error cases such as an IndexNotFoundException. Ideally we would like to assume everything is fine, issue the write, and then handle failures. If it is an INFE then we could create the index with the correct mappings.

@vladimirdolzhenko
Copy link
Contributor Author

@bleskes could you please have a look into code if we can discuss and choose the name of parameter later ?

@jasontedor
Copy link
Member

Thank you for articulating the use-case @bleskes. I am leaning towards this but I have a clarifying question.

If beats doesn't ship data to ES (i.e., it ships to Kafka or logstash), people still need to bootstrap ES manually and they forget.

I think one thing missing for me in this case is how does the request flag to disable auto-index creation get set? It seems to me that it would require a new parameter on the Elasticsearch output plugin in Logstash that requires the user to manually set this?

Some people rely on external provisioning systems to configure ES and disable beats' auto setup features. This sometimes goes wrong and they end up with data in ES they need to clean up.

This case is clear. If Beats auto-setup is disabled (setup.template.enabled set to false), then Beats will send the request flag to disable auto-index creation on each request, and can log when it receives an index not-found exception caused by this flag.

@tsg
Copy link

tsg commented Nov 7, 2018

If beats doesn't ship data to ES (i.e., it ships to Kafka or logstash), people still need to bootstrap ES manually and they forget.

I think one thing missing for me in this case is how does the request flag to disable auto-index creation get set? It seems to me that it would require a new parameter on the Elasticsearch output plugin in Logstash that requires the user to manually set this?

Logstash has an internal mechanism (called @metadata) to pass information from the input to the output. So the beats-input can set a flag in @metadata for each event/bulk, and the Elasticsearch output can set the option automatically if it sees this flag in @metadata. So setting this flag could be the default when getting data from the beats-input, or the Beat can set it (it's possible to send @metadata from the Beat as well, over lumberjack) if there are any reasons to believe that that bootstrap didn't happen.

Pinging @jsvd and @urso because this is their domain now.

@vladimirdolzhenko
Copy link
Contributor Author

We've discussed in another channel and agreed to the name of parameter:
?auto_create_index=false, can never be set to true

@urso
Copy link

urso commented Nov 7, 2018

One can disable all bootstrapping in beats and users do so in favour of running <beat> setup separately. We can easily add a setting or use setup.template.enabled to send this flag when publishing to Logstash.

The flag is valid for all events in a bulk request, but we send @metadata per event to Logstash. A bulk request from Logstash can potentially contain multiple events from different beats or other inputs, some having the flag set and others not. I don't think we want to have a per-bulk-setting being based on potential conflicting per event settings in Logstash.

Actually when sending Beats events directly or indirectly via Logstash, it would make sense to always enable this flag. But again, this might surprise users mixing Beats + other input types.

@bleskes
Copy link
Contributor

bleskes commented Nov 7, 2018

A bulk request from Logstash can potentially contain multiple events from different beats or other inputs, some having the flag set and others not. I don't think we want to have a per-bulk-setting being based on potential conflicting per event settings in Logstash.

@urso thanks for this insight and I fully agree. This can be solved both on the ES side (making this a per index request item parameter - which is something we really want to avoid to reduce the scope of the issue) and on the Logstash side (by making sure request never mix values of the relevant meta fields). Are you the right person to see what it would mean on the logstash output side?

@tsg
Copy link

tsg commented Nov 7, 2018

If not @urso, then @jsvd is :)

@jsvd
Copy link
Member

jsvd commented Nov 7, 2018

In logstash, mixing data from multiple sources in the same bulk request is a typical optimization we suggest to users: if possible use a single ES output and rely as much as possible on metadata to place events into different indices, figure out the document_id, document type (in pre ES 6.x era).

on the Logstash side (by making sure request never mix values of the relevant meta fields).

What we advise users when you can't mix the events in the same bulk request is then to use multiple ES outputs, but here we want the multiplexing to be done seamlessly.
Right now the only auto slicing we perform on the batch of events received from inputs+filters is by size to avoid 413 from ES. To solve this issue we'd have to perform other slices before the size slice.
Do we foresee other bulk level criteria for which we'd like to pre-slice the bulk request or only this disable_auto_create_index parameter?

@bleskes
Copy link
Contributor

bleskes commented Nov 9, 2018

@jsvd Thanks for explaining. Understood. I think that's quite a change on the Logstash side so we explored options on the ES side. The suggestion it to add an auto_create_index flag to each DocWriteRequest (the interface Index, Delete and Update request shares). This will allow you to keep current semantics and specify the flag on each bulk line. There is a down side though - the size of the request will grow with 27 bytes per line (unless you use smile which I suspect will compress repetition, or use other compression). Is that OK?

@jsvd
Copy link
Member

jsvd commented Nov 12, 2018

Thanks @bleskes that's great. Compression for requests is available on the ES output plugin but disabled by default.

To support this we can add a new configuration option called auto_create_index that either takes a boolean or the name of a field. For each event we either retrieve the value of the field in the event or just set it to true/false.

The suggestion it to add an auto_create_index flag to each DocWriteRequest (the interface Index, Delete and Update request shares).

So the create action will raise an error if auto_create_index is set?

@vladimirdolzhenko
Copy link
Contributor Author

@jsvd create is IndexRequest on our side, so it has to handle auto_create_index in the same way

@vladimirdolzhenko
Copy link
Contributor Author

vladimirdolzhenko commented Nov 16, 2018

we've discussed with @jsvd in another channel for the case when bulk request have requests for the same index but with different auto_create_index values :

ES does check and create missing indices before doing any operations from bulk request and bulk requests are executing concurrently: it is reasonably to eliminate that kind of inaccuracy. The most straight-forward is to reject entire bulk request as malformed

updated: It falls into default behaviour if there is at least one request with no value for auto_create_index field for the same index. I.e. to avoid index creation all requests for the same index have to have "auto_create_index": false.

@vladimirdolzhenko vladimirdolzhenko requested review from bleskes and removed request for bleskes November 20, 2018 12:54
@rjernst rjernst added :Data Management/Indices APIs APIs to create and manage indices and templates and removed :Data Management/Indices APIs APIs to create and manage indices and templates labels Feb 26, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Indices APIs)

@martijnvg
Copy link
Member

Closing this PR, because we think that there is better approach to achieve the same behaviour and because the PR is stale (we should have dealt with this earlier, but it fell though many cracks).

We discussed in a different context about overriding the action.auto_create_index feature for specific namespaces by adding a new updatable cluster setting (action.manual_create_index), which would accept a list of patterns. If an index name matches with one of these patterns then it would not auto-create an index (even if action.auto_create_index would be setup to create the index). We think that this approach is less intrusive than what is build here and at the same time flexible enough.

@martijnvg martijnvg closed this Mar 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Indices APIs APIs to create and manage indices and templates >enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.