Skip to content

[Ingest Pipelines] Preserve unknown fields in processors #91146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jloleysens
Copy link
Contributor

@jloleysens jloleysens commented Feb 11, 2021

Summary

After elastic/elasticsearch#65314 was merged we discovered that when editing a processor unknown fields are not returned from the forms in ingest pipelines editor.

This contribution adds logic to pick out the unknown fields and keep them, even after updating a processor.

Also fix #91135

How to test

  1. Start Elastic and Kibana on basic
  2. Send the following request to ES to create a pipeline called nginx
cURL request
curl --request PUT \
  --url 'http://localhost:9200/_ingest/pipeline/nginx?=' \
  --header 'Authorization: Basic ZWxhc3RpYzpjaGFuZ2VtZQ==' \
  --header 'Content-Type: application/json' \
  --data '{
  "description": "Pipeline for parsing Nginx ingress controller access logs. Requires the geoip and user_agent plugins.",
  "processors": [
    {
			"set": {
				"field": "test",
				"value": "ok",
				"media_type": "application/json"
			},
      "grok": {
				"description": "test",
        "field": "message",
        "patterns": [
          "(%{NGINX_HOST} )?\"?(?:%{NGINX_ADDRESS_LIST:nginx.ingress_controller.remote_ip_list}|%{NOTSPACE:source.address}) - (-|%{DATA:user.name}) \\[%{HTTPDATE:nginx.ingress_controller.time}\\] \"%{DATA:nginx.ingress_controller.info}\" %{NUMBER:http.response.status_code:long} %{NUMBER:http.response.body.bytes:long} \"(-|%{DATA:http.request.referrer})\" \"(-|%{DATA:user_agent.original})\" %{NUMBER:nginx.ingress_controller.http.request.length:long} %{NUMBER:nginx.ingress_controller.http.request.time:double} \\[%{DATA:nginx.ingress_controller.upstream.name}\\] \\[%{DATA:nginx.ingress_controller.upstream.alternative_name}\\] (%{UPSTREAM_ADDRESS}|-) (%{NUMBER:nginx.ingress_controller.upstream.response.length:long}|-) (%{NUMBER:nginx.ingress_controller.upstream.response.time:double}|-) (%{NUMBER:nginx.ingress_controller.upstream.response.status_code:long}|-) %{GREEDYDATA:nginx.ingress_controller.http.request.id}"
        ],
        "pattern_definitions": {
          "NGINX_HOST": "(?:%{IP:destination.ip}|%{NGINX_NOTSEPARATOR:destination.domain})(:%{NUMBER:destination.port})?",
          "NGINX_NOTSEPARATOR": "[^\t ,:]+",
          "NGINX_ADDRESS_LIST": "(?:%{IP}|%{WORD})(\"?,?\\s*(?:%{IP}|%{WORD}))*",
          "UPSTREAM_ADDRESS": "%{IP:nginx.ingress_controller.upstream.ip}(:%{NUMBER:nginx.ingress_controller.upstream.port})?"
        },
        "ignore_missing": true
      }
    },
    {
      "grok": {
        "field": "nginx.ingress_controller.info",
        "patterns": [
          "%{WORD:http.request.method} %{DATA:url.original} HTTP/%{NUMBER:http.version}",
          ""
        ],
        "ignore_missing": true
      }
    },
    {
      "remove": {
        "field": "nginx.ingress_controller.info"
      }
    },
    {
      "split": {
        "field": "nginx.ingress_controller.remote_ip_list",
        "separator": "\"?,?\\s+",
        "ignore_missing": true
      }
    },
    {
      "split": {
        "field": "nginx.ingress_controller.origin",
        "separator": "\"?,?\\s+",
        "ignore_missing": true
      }
    },
    {
      "set": {
        "field": "source.address",
        "if": "ctx.source?.address == null",
        "value": ""
      }
    },
    {
      "script": {
        "if": "ctx.nginx?.access?.remote_ip_list != null && ctx.nginx.ingress_controller.remote_ip_list.length > 0",
        "lang": "painless",
        "source": "boolean isPrivate(def dot, def ip) {\n  try {\n    StringTokenizer tok = new StringTokenizer(ip, dot);\n    int firstByte = Integer.parseInt(tok.nextToken());\n    int secondByte = Integer.parseInt(tok.nextToken());\n    if (firstByte == 10) {\n      return true;\n    }\n    if (firstByte == 192 && secondByte == 168) {\n      return true;\n    }\n    if (firstByte == 172 && secondByte >= 16 && secondByte <= 31) {\n      return true;\n    }\n    if (firstByte == 127) {\n      return true;\n    }\n    return false;\n  }\n  catch (Exception e) {\n    return false;\n  }\n} try {\n  ctx.source.address = null;\n  if (ctx.nginx.ingress_controller.remote_ip_list == null) {\n    return;\n  }\n  def found = false;\n  for (def item : ctx.nginx.ingress_controller.remote_ip_list) {\n    if (!isPrivate(params.dot, item)) {\n      ctx.source.address = item;\n      found = true;\n      break;\n    }\n  }\n  if (!found) {\n    ctx.source.address = ctx.nginx.ingress_controller.remote_ip_list[0];\n  }\n} catch (Exception e) {\n  ctx.source.address = null;\n}",
        "params": {
          "dot": "."
        }
      }
    },
    {
      "remove": {
        "field": "source.address",
        "if": "ctx.source.address == null"
      }
    },
    {
      "grok": {
        "field": "source.address",
        "patterns": [
          "^%{IP:source.ip}$"
        ],
        "ignore_failure": true
      }
    },
    {
      "remove": {
        "field": "message"
      }
    },
    {
      "rename": {
        "field": "@timestamp",
        "target_field": "event.created"
      }
    },
    {
      "date": {
        "field": "nginx.ingress_controller.time",
        "target_field": "@timestamp",
        "formats": [
          "dd/MMM/yyyy:H:m:s Z"
        ],
        "on_failure": [
          {
            "append": {
              "field": "error.message",
              "value": "{{ _ingest.on_failure_message }}"
            }
          }
        ]
      }
    },
    {
      "remove": {
        "field": "nginx.ingress_controller.time"
      }
    },
    {
      "user_agent": {
        "field": "user_agent.original",
        "ignore_missing": true
      }
    },
    {
      "geoip": {
        "field": "source.ip",
        "target_field": "source.geo",
        "ignore_missing": true
      }
    },
    {
      "geoip": {
        "database_file": "GeoLite2-ASN.mmdb",
        "field": "source.ip",
        "target_field": "source.as",
        "properties": [
          "asn",
          "organization_name"
        ],
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "source.as.asn",
        "target_field": "source.as.number",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "source.as.organization_name",
        "target_field": "source.as.organization.name",
        "ignore_missing": true
      }
    },
    {
      "set": {
        "field": "event.kind",
        "value": "event"
      }
    },
    {
      "append": {
        "field": "event.category",
        "value": "web"
      }
    },
    {
      "append": {
        "field": "event.type",
        "value": "info"
      }
    },
    {
      "set": {
        "field": "event.outcome",
        "value": "success",
        "if": "ctx?.http?.response?.status_code != null && ctx.http.response.status_code < 400"
      }
    },
    {
      "set": {
        "field": "event.outcome",
        "value": "failure",
        "if": "ctx?.http?.response?.status_code != null && ctx.http.response.status_code >= 400"
      }
    },
    {
      "lowercase": {
        "field": "http.request.method",
        "ignore_missing": true
      }
    },
    {
      "append": {
        "field": "related.ip",
        "value": "{{source.ip}}",
        "if": "ctx?.source?.ip != null"
      }
    },
    {
      "append": {
        "field": "related.ip",
        "value": "{{destination.ip}}",
        "if": "ctx?.destination?.ip != null"
      }
    },
    {
      "append": {
        "field": "related.user",
        "value": "{{user.name}}",
        "if": "ctx?.user?.name != null"
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "field": "error.message",
        "value": "{{ _ingest.on_failure_message }}"
      }
    }
  ]
}
'
  1. Open the ingest pipelines UI in the management section and manage the nginx pipeline
  2. Edit the first processor (which should be a set processor). Change the field value to anything, e.g., "foobar"
  3. Save the processor
  4. Click on the "Show request" link at the bottom of the UI, the first processor should still have the "media_type" field
  5. Save the pipeline
  6. Manage it again and see that the media_type value is still present on the set processor

Notes

These changes ensure that unknown values in processor options are preserved. We only check for unknown entries at the top-level of the options object; the assumption is that options configuration with nested values do not exist.

That is:

{
  processors: [
    { processor_1: { checked: 'test' } },
    { processor_2: { checked: 'test', checked_2: { not_checked: 'test' } } },
  ]
}

Release note

Fixes a small form validation bug that required an input for the set processor value field and gsub processor replacement field. Both fields now accept an empty value to either set a value to an empty string in the case of the set processor and remove any matched text in the case of the gsub processor.

Checklist

Delete any items that are not applicable to this PR.

@jloleysens jloleysens added v8.0.0 Team:Kibana Management Dev Tools, Index Management, Upgrade Assistant, ILM, Ingest Node Pipelines, and more release_note:skip Skip the PR/issue when compiling release notes Feature:Ingest Node Pipelines Ingest node pipelines management v7.12.0 labels Feb 11, 2021
@jloleysens jloleysens requested a review from a team as a code owner February 11, 2021 15:20
@elasticmachine
Copy link
Contributor

Pinging @elastic/es-ui (Team:Elasticsearch UI)

@jloleysens jloleysens changed the title [Ingest Pipeliens] Preserve unknown fields in processors [Ingest Pipelines] Preserve unknown fields in processors Feb 11, 2021
Copy link
Contributor

@alisonelizabeth alisonelizabeth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM. Verified locally. Thanks for the quick turnaround on this!

@alisonelizabeth
Copy link
Contributor

@elasticmachine merge upstream

@jloleysens
Copy link
Contributor Author

@elasticmachine merge upstream

@jloleysens
Copy link
Contributor Author

jloleysens commented Feb 15, 2021

@alisonelizabeth , I've realised that this implementation is based on the incorrect premise that fields without values willl still be included in the processor form output - by default the form lib will strip them. The misunderstanding snuck in for me with the undefinedIfValue serializer that exists on many forms outputting field_name: undefined.

I will update this implementation to be more robust.

* This is important for allowing configuration of empty text fields in certain processors that
* remove values from their inputs.
*/
stripEmptyFields: false,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite an important change for this fix, it also means we need to add the appropriate serializer for all fields. I tested all forms locally and the LGTM. But it would be good to check them for any unexpected resulting "foo": "" entries.

Copy link
Contributor

@alisonelizabeth alisonelizabeth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jloleysens I went through and retested all the processors. I found two potential issues -

I think the serializer needs to be fixed for the regex_file option in the user agent processor. If I do not provide a value for the field, the request looks like:

    {
      "user_agent": {
        "field": "sdf",
        "regex_file": ""
      }
    }

I also noticed for the foreach processor, I am able to save the form without specifying a processor and the request retains the {} as the value, which results in an error from ES.

     {
      "foreach": {
        "field": "asdf",
        "processor": {}
      }
    }

@@ -50,15 +51,6 @@ const fieldsConfig: FieldsConfig = {
'xpack.ingestPipelines.pipelineEditor.gsubForm.replacementFieldHelpText',
{ defaultMessage: 'Replacement text for matches.' }
),
validations: [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should add some help text indicating that leaving the input blank will default to an empty string value.

@@ -81,7 +81,7 @@ const fieldsConfig: FieldsConfig = {
lang: {
type: FIELD_TYPES.TEXT,
deserializer: String,
serializer: from.undefinedIfValue('painless'),
serializer: (v: unknown) => (v === 'painless' || v === '' ? undefined : v),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] I know this wasn't introduced in this PR - but I noticed that if I set the lang to "painless" via the ES API, it does retain the value. I think it could be potentially confusing to a user if they add painless to the text field in the UI and then it is stripped from the request.

@@ -35,15 +28,6 @@ const fieldsConfig: FieldsConfig = {
helpText: i18n.translate('xpack.ingestPipelines.pipelineEditor.setForm.valueFieldHelpText', {
defaultMessage: 'Value for the field.',
}),
validations: [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about possibly adding some help text

@jloleysens
Copy link
Contributor Author

@elasticmachine merge upstream

@jloleysens
Copy link
Contributor Author

Thanks for the review @alisonelizabeth ! I think I have addressed all of your feedback, would you mind taking another look?

I also noticed for the foreach processor, I am able to save the form without specifying a processor and the request retains the {} as the value, which results in an error from ES.

I'm going to hold off on this change for now since it is pre-exsting behaviour, I think the error response from ES is also very helpful in guiding users. Let me know what you think!

Copy link
Contributor

@alisonelizabeth alisonelizabeth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest code LGTM. Did not test again. I'm fine with holding off on fixing the foreach behavior.

@jloleysens
Copy link
Contributor Author

@elasticmachine merge upstream

@jloleysens jloleysens merged commit edc11d9 into elastic:master Feb 17, 2021
@jloleysens jloleysens added release_note:fix and removed release_note:skip Skip the PR/issue when compiling release notes labels Feb 17, 2021
@jloleysens jloleysens deleted the ingest-pipelines/fix-preserve-unknown-fields-in-processors branch February 17, 2021 11:59
jloleysens added a commit to jloleysens/kibana that referenced this pull request Feb 17, 2021
* keep known and unknown options in processor config

* added test for preserving unknown fields

* refactor form for NOT stripping empty field values, also allow empty "value" for set and gsub

* remove unused i18n

* fix user agent form serialization, update field help text

* remove out of date translation

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
ingestPipelines 722.7KB 723.5KB +788.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

jloleysens added a commit that referenced this pull request Feb 17, 2021
…1630)

* keep known and unknown options in processor config

* added test for preserving unknown fields

* refactor form for NOT stripping empty field values, also allow empty "value" for set and gsub

* remove unused i18n

* fix user agent form serialization, update field help text

* remove out of date translation

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Ingest Node Pipelines Ingest node pipelines management release_note:fix Team:Kibana Management Dev Tools, Index Management, Upgrade Assistant, ILM, Ingest Node Pipelines, and more v7.12.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[IngestPipelines] Allow empty replacement value for Gsub processor
4 participants