Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: significant rework of Event<->IngestDocument marshalling #51

Merged
merged 4 commits into from
May 5, 2023

Conversation

yaauie
Copy link
Member

@yaauie yaauie commented Apr 24, 2023

This PR Fundamentally changes the model for marshalling Events to IngestDocument`s
and back again to solve three inter-related issues:

Resolves: #47

The IngestDocument metadata fields are no longer added to the top-level of
the Event, and are now separately routed to [@metadata][_ingest_document],
fixing an issue where the presence of Elasticsearch-reserved fields such as
the top-level _version would cause a downstream Elasticsearch output to be
unable to index the event

Resolves: #54

The top-level @timestamp and @version fields are no longer excluded from
the IngestDocument, as required by some existing integration pipelines

Partially-Resolves: #65

Field-type conversions have been improved by adding a two-way-mapping between
the Logstash-internal Timestamp-type object and the equivalent
ZonedDateTime-object used in several Ingest Common processors

Review Note: unfortunately, these three issues became pretty tangled, so this code is best reviewed by looking at the result of the changes to the IngestDuplexMarshaller and not the diff

Copy link
Contributor

@roaksoax roaksoax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@yaauie
Copy link
Member Author

yaauie commented Apr 25, 2023

@roaksoax unfortunately this opens up a can-of-worms because ingest pipelines do intentionally set metadata fields that are used downstream for indexing (see failing integration specs), and we currently have no way to propagate them to our Elasticsearch output plugin, which must supply them separately from the event payload itself.

I have considered adding a [@metadata][_ingest_pipeline_metadata] to contain any ingest pipeline metadata (possibly _id, _index, _version, _routing, and _version_type, all of which have sprintf-powered options on the ES output), but (1) the ES output must be configured explicitly with a sprintf for each option and (2) does not behave well when a sprintf references fields that are unset.

Example usage (would work IFF all five magic fields are set):

output {
  elasticsearch {
    # ... connection details
    document_id => "%{[@metadata][_ingest_pipeline_metadata][_id]}"
    index => "%{[@metadata][_ingest_pipeline_metadata][_index]}"
    version => "%{[@metadata][_ingest_pipeline_metadata][_version]}"
    version_type => "%{[@metadata][_ingest_pipeline_metadata][_version_type]}"
    routing => "%{[@metadata][_ingest_pipeline_metadata][_routing]}"
  }
}

While we theoretically could pre-populate all five with values when they are not present, we have no way to tell Elasticsearch to do its "default" id generation or routing.

The ES-ingest reserved Metadata fields by definition _cannot_ be top-level
fields of the events we send to Elasticsearch and therefore should not be
included in their reserved place in the Event that we generate from an
IngestDocument.

We already safely extract the `_version` from the IngestDocument's metadata
and safely map it to our LS-reserved `@version`, but there is reason to
believe that _any_ of these fields could be directly useful to a Logstash
pipeline that will eventually persist the events to Elasticsearch.

Here, we make all of these fields, sans their underscore prefix, available
in the metadata-key `[@metadata][_ingest_document]`, which prevents them
from clashing or being included in output accidentally but makes them
available to the downstream LS pipeline if the author wishes to access them.
Copy link
Collaborator

@mashhurs mashhurs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except doc N/A to _index map change, rest of them LGTM!

@@ -178,9 +178,27 @@ Changes to the IngestDocument will be reflected in the resulting Logstash Event,

| `tags` | `tags` | when ingest processing produces a top-level `tags` field that is not a collection of strings, it will be made available via the event's `_tags` field.

| _N/A_ | `_index`, `
Copy link
Collaborator

@mashhurs mashhurs Apr 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| _N/A_ | `_index`, `

I got confused with this change. Can you please explain what it is? Basically, how non-available field can be mapped to _index?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was accidental and will be removed. Oops.

yaauie added 2 commits May 4, 2023 21:42
Fundamentally changes the model for marshalling `Event`s to IngestDocument`s
and back again to solve three inter-related issues:

Resolves: #47
-------------
The `IngestDocument` metadata fields are no longer added to the top-level of
the `Event`, and are now separately routed to `[@metadata][_ingest_document]`,
fixing an issue where the presence of Elasticsearch-reserved fields such as
the top-level `_version` would cause a downstream Elasticsearch output to be
unable to index the event

Resolves: #54
-------------
The top-level `@timestamp` and `@version` fields are no longer excluded from
the `IngestDocument`, as required by some existing integration pipelines

Partially-Resolves: #65
-------------
Field-type conversions have been improved by adding a two-way-mapping between
the Logstash-internal `Timestamp`-type object and the equivalent
`ZonedDateTime`-object used in several Ingest Common processors
@yaauie yaauie changed the title fix: omit ES-reserved fields from Logstash Event fix: significant rework of Event<->IngestDocument marshalling May 5, 2023
Copy link
Collaborator

@mashhurs mashhurs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I greatly appreciate this work @yaauie, thanks so much !
This changes resolve deep-core issues of plugin-event to IngestDocument transfer safety.

I have tested with major issues with some integrations such as Apache, Nginx and LGTM!
We have to land ASAP this change and release the new 0.0.2 version.

@yaauie
Copy link
Member Author

yaauie commented May 5, 2023

Merging now, will land a new release of the plugin on Monday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants