-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge custom and core multi_fields array #982
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jonathan-buttner! Sorry for taking a bit for an initial review.
The use-case makes good sense, and I think this will be a good addition to the tooling. After testing out the changes, I did have a couple of notes.
scripts/schema/loader.py
Outdated
def dedup_and_merge_lists(list_a, list_b): | ||
list_a_set = array_of_dicts_to_set(list_a) | ||
list_b_set = array_of_dicts_to_set(list_b) | ||
return set_of_sets_to_array(list_a_set | list_b_set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor issue I stumbled across while testing this out. Not sure it would be a blocker to merging, but worth noting the behavior.
The union will remove exact duplicate items:
> list_a_set
{frozenset({('name', 'text'), ('type', 'text')})}
> list_b_set
{frozenset({('name', 'text'), ('type', 'text')}), frozenset({('type', 'keyword'), ('normalizer', 'lowercase'), ('name', 'caseless')})}
> list_a_set | list_b_set
{frozenset({('name', 'text'), ('type', 'text')}), frozenset({('type', 'keyword'), ('normalizer', 'lowercase'), ('name', 'caseless')})}
But if the sets are not exact duplicates, it could lead to duplicate field names:
> list_a_set
{frozenset({('type', 'text'), ('name', 'text')})}
> list_b_set
{frozenset({('normalizer', 'lowercase'), ('type', 'keyword'), ('name', 'caseless')}), frozenset({('type', 'keyword'), ('name', 'text')})}
> list_a_set | list_b_set
{frozenset({('normalizer', 'lowercase'), ('type', 'keyword'), ('name', 'caseless')}), frozenset({('type', 'text'), ('name', 'text')}), frozenset({('type', 'keyword'), ('name', 'text')})}
schema include file:
---
- name: file
title: File
group: 2
short: Fields describing files.
description: >
Custom file
fields:
- name: path
multi_fields:
- name: caseless
type: keyword
normalizer: lowercase
- name: text
type: keyword <= I imagine this would only happen by accident 😃
Resulting intermediate state:
multi_fields:
- flat_name: file.path.caseless
ignore_above: 1024
name: caseless
normalizer: lowercase
type: keyword
- flat_name: file.path.text
ignore_above: 1024
name: text
type: keyword
- flat_name: file.path.text
name: text
norms: false
type: text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good catch, what do we think the expected behavior should be in this scenario? I could put in a check to ensure that two of the same name
fields don't exist in the resulting set and throw an error if they do? Or maybe just have core override?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should dedupe on name
and take the most recent definition in the case of dupes (this would allow for overrides).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@webmat do you have any thoughts? I recall back in #864, logic was removed from the tooling to allow --include
supplied custom fields to be more permissive:
This means the tooling must now accept included files as they are, with all of the power this entails.
Perhaps we simply make sure to note that users need to be aware of introducing such duplicates fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @madirey. We should keep it simple and only ensure we have unique multi-field names.
The --include
option is meant to override, so the ideal behaviour is for a custom multi-field definition to replace or be merged with an entry of the same name. I'm on the fence on whether to merge/replace an entry of the same name, though. Happy to be convinced either way.
But to take a concrete example, let's say someone has tuned a normalizer that works well for user agent strings, I want them to be able to replace the default user_agent.original.text
multi-field with such a custom definition:
multi_fields:
- name: text
norms: false
type: text
normalizer: ua_normalizer
I think I have a preference with merging the pre-existing multi-field definitions of the same name, as this is more in line with how everything else is handled with custom fields. And it has the bonus of allowing a more terse custom definition:
- name: text
normalizer: ua_normalizer
Thanks for doing this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this, that's a good addition!
Side note: you're using this to add a .caseless
multi-field, but with the coming of query param case_sensitive
in 7.10, are you sure you need this multi-field?
In any case, this is a good addition, this will make adjustments to multi-fields much smoother.
scripts/schema/loader.py
Outdated
def dedup_and_merge_lists(list_a, list_b): | ||
list_a_set = array_of_dicts_to_set(list_a) | ||
list_b_set = array_of_dicts_to_set(list_b) | ||
return set_of_sets_to_array(list_a_set | list_b_set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @madirey. We should keep it simple and only ensure we have unique multi-field names.
The --include
option is meant to override, so the ideal behaviour is for a custom multi-field definition to replace or be merged with an entry of the same name. I'm on the fence on whether to merge/replace an entry of the same name, though. Happy to be convinced either way.
But to take a concrete example, let's say someone has tuned a normalizer that works well for user agent strings, I want them to be able to replace the default user_agent.original.text
multi-field with such a custom definition:
multi_fields:
- name: text
norms: false
type: text
normalizer: ua_normalizer
I think I have a preference with merging the pre-existing multi-field definitions of the same name, as this is more in line with how everything else is handled with custom fields. And it has the bonus of allowing a more terse custom definition:
- name: text
normalizer: ua_normalizer
@jonathan-buttner Is this still a need, or are you pursuing using the new |
Sorry completely dropped the ball on this one. I've been trying to get some features done for 7.11. I think it'd be nice to still have this. I probably won't get to in until after feature freeze for 7.11 though, if that's ok. I don't think it's super important but would be nice to have it. |
@ebeahan I think this PR is in a better spot now haha. I updated the description as well but I took the approach deduping based on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adjusting and overriding based on the multi-field name 👍
I have comments on how the tests are put together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the one nit for the changelog, looks good! 👍
Thanks Mat and Eric! |
* bumping version for 1.x release branch (elastic#921) * [1.x] add related.hosts (elastic#913) (elastic#924) * [1.x][DOCS] Fixes SIEM links (elastic#936) * [1.x] Consolidate field-details doc template (elastic#897) (elastic#946) * Add http.[request|response].mime_type (elastic#944) (elastic#949) * [1.x] Cut 1.6 Changelog (elastic#933) (elastic#952) (elastic#953) Co-authored-by: Mathieu Martin <mathieu.martin@elastic.co> * [1.x] Add threat.technique.subtechnique (elastic#951) (elastic#956) Co-authored-by: Ross Wolf <31489089+rw-access@users.noreply.github.com> * [1.x] Nest as for foreign reuse (elastic#960) (elastic#962) * [1.x] Remove `expected_event_types` from protocol (elastic#964) (elastic#965) * [1.x] Expand definitions of source and destination field sets (elastic#967) (elastic#973) * [1.x] Introduce `--strict` flag (elastic#937) (elastic#975) * [1.x] Add example value composite type checking (elastic#966) (elastic#976) * Add example value composite type checking (elastic#966) * generate csv artifact * [1.x] Add event category configuration (elastic#963) (elastic#977) * [1.x] Add normalizer multi-field capability (elastic#971) (elastic#978) Co-authored-by: Eric Beahan <ebeahan@gmail.com> Co-authored-by: Madison Caldwell <madison.rey.caldwell@gmail.com> * [1.x] Add mapping network event guidance doc (elastic#969) (elastic#983) * [1.x] Removing unneeded link under `Additional Information` (elastic#984) (elastic#985) * [1.x] Add discrete attribute to field details page headers (elastic#989) (elastic#990) * [1.x] Uniformity across domain name breakdown fields (elastic#981) (elastic#994) Co-authored-by: Mathieu Martin <webmat@gmail.com> * Add --oss flag to the ECS generator script (elastic#991) (elastic#995) * Add network directions ingress and egress (elastic#945) (elastic#997) * Mention ECS Mapper in the main documentation (elastic#987) (elastic#1000) Co-authored-by: Dan Roscigno <dan@roscigno.com> * [1.x] Introduce experimental artifacts (elastic#993) (elastic#1001) Co-authored-by: Mathieu Martin <webmat@gmail.com> * Bump version to 1.8.0-dev in branch 1.x (elastic#1011) * Cut 1.7 changelog (elastic#1010) (elastic#1012) * [1.x] Clarify that file extension should exclude the dot. (elastic#1016) (elastic#1020) * [1.x] Add usage docs section (elastic#988) (elastic#1024) Co-authored-by: Mathieu Martin <mathieu.martin@elastic.co> * [1.x] feat: include alias path when generating template (elastic#877) (elastic#1035) Co-authored-by: Richard Gomez <32133502+rgmz@users.noreply.github.com> * [1.x] Add support for `scaling_factor` in the generator (elastic#1042) (elastic#1055) Co-authored-by: Mathieu Martin <mathieu.martin@elastic.co> * [1.x] Add fallback for constant_keyword (elastic#1046) (elastic#1056) Co-authored-by: Mathieu Martin <mathieu.martin@elastic.co> * [1.x] Add wildcard type support to go code generator (elastic#1050) (elastic#1057) * add wildcard type support * also add version and constant_keyword * changelog * [1.x] New default make task that generates main and experimental artifacts. (elastic#1041) (elastic#1060) Also changing the order of the 'generate' task: it now starts with the new generator, then runs the legacy scripts. * [1.x] Change the index pattern in the sample template. (elastic#1048) (elastic#1068) * [1.x] Prepare link to Logs docs changing with the 7.10 release in "getting-started" (elastic#1073) (elastic#1079) Co-authored-by: EamonnTP <Eamonn.Smith@elastic.co> * [1.x] Prepare link to Logs docs changing with the 7.10 release in "products-solutions" page (elastic#1074) (elastic#1083) Co-authored-by: EamonnTP <Eamonn.Smith@elastic.co> * [1.x] Add event.category session. (elastic#1049) (elastic#1093) Co-authored-by: Mathieu Martin <mathieu.martin@elastic.co> * [1.x] Add event.category registry (elastic#1040) (elastic#1094) Co-authored-by: Mathieu Martin <mathieu.martin@elastic.co> * [1.x] Add --ref support for experimental artifacts (elastic#1063) (elastic#1101) Co-authored-by: Mathieu Martin <webmat@gmail.com> * [1.x] Remove experimental event.original definition (elastic#1053) (elastic#1104) * [1.x] Add missing `process.thread.name` to experimental definitions (elastic#1103) (elastic#1106) * [1.x] Remove index parameter for wildcard fields (elastic#1115) (elastic#1119) * [1.x] Add dns.answer object into experimental schema (elastic#1118) (elastic#1121) * [1.x] Clarify x509 definition guidance for network events with only one cert (elastic#1114) (elastic#1123) * [1.x] Indicate when artifacts include experimental changes (elastic#1117) (elastic#1125) * [1.x] Add os.type field, with list of allowed values (elastic#1111) (elastic#1130) * [1.x] Add support for constant_keyword's 'value' parameter (elastic#1112) (elastic#1132) * [1.x] Beta label support (elastic#1051) (elastic#1133) Co-authored-by: Mathieu Martin <webmat@gmail.com> * [1.x] Backport elastic#1134 and elastic#1135 (elastic#1136) * Remove temporary ifeval in "getting started" page, add link to Metrics docs (elastic#1134) * Remove temporary ifeval from products page, add link to Metrics (elastic#1135) * Two small documentation backports (elastic#1149) * Remove an incorrect `event.type` from the 'converting' page (elastic#1146) * Mention Logstash support for ECS in the 'products' page (elastic#1147) * [1.x] Reinforce the exclusion of the leading dot from url.extension (elastic#1151) (elastic#1152) * [1.x] Make all fields linkable directly via an HTML ID (elastic#1148) (elastic#1154) * [1.x] Tracing fields should be at the root (elastic#1165) * Add notice to the tracing field set, about not nesting field names. (elastic#1162) * Tracing fields should be at top level in Beats artifact (elastic#1164) * [1.x] Usage of brackets for a URL containing IPv6 address (elastic#1131) (elastic#1168) * [1.x] 6.x index template data type fallback (elastic#1171) (elastic#1172) * [1.x] Apply RFC 0007 stage 3 changes - multi-user (elastic#1066) (elastic#1175) Conflict: deleted file rfcs/text/0007-multiple-users.md as RFCs are not backported to version branches. * [1.x] Handle `error.stack_trace` case for ES 6.x template (elastic#1176) (elastic#1177) * [1.x] Add composable index templates artifacts (elastic#1156) (elastic#1179) * [1.x] Move _meta section back inside mappings, in legacy templates. (elastic#1186) (elastic#1187) Backports the following commits to 1.x: * Move _meta section back inside mappings, in legacy templates. (elastic#1186) This fixes an issue introduced by elastic#1156, discovered in elastic#1180. Composable templates support `_meta` at the template's root, but legacy templates don't. So we're just putting it back inside the mappings for legacy templates. This also fixes missing updates to the component template, after the introduction of wildcard in elastic#1098. * [1.x] Apply the RFC 0005 stage 2 (host metrics) changes in the experimental artifacts (elastic#1159) (elastic#1184) Co-authored-by: Mathieu Martin <mathieu.martin@elastic.co> * [1.x] Stage 3 changes for wildcard RFC 0001 (elastic#1098) (elastic#1183) * [1.x] Conditional handling in es_template.template_settings (elastic#1191) (elastic#1192) * [1.x] Artifacts docs page (elastic#1189) (elastic#1195) * [1.x] Remove beta warning label from categorization fields docs (elastic#1067) (elastic#1196) * [1.x] Correct wording of `event.reference` description (elastic#1181) (elastic#1197) * Bump version to 1.9.0-dev in branch 1.x (elastic#1198) * [1.x] Cut 1.8 FF changelog.next.md elastic#1199 (elastic#1201) * Merge custom and core multi_fields arrays (elastic#982) (elastic#1213) Co-authored-by: Jonathan Buttner <56361221+jonathan-buttner@users.noreply.github.com> * [1.x] Stage 2 changes for RFC 0009 - data_stream fields (elastic#1215) (elastic#1222) * [1.x] add http.request.id (elastic#1208) (elastic#1223) Co-authored-by: Eric Beahan <eric.beahan@elastic.co> Co-authored-by: Gil Raphaelli <gil@elastic.co> * [1.x] add cloud.service.name (elastic#1204) (elastic#1224) * add cloud.platform * expand cloud.platform description * move to cloud.service.name Co-authored-by: Gil Raphaelli <gil@elastic.co> * [1.x] Add ssdeep hash (elastic#1169) (elastic#1227) Co-authored-by: Andrew Stucki <andrew.stucki@elastic.co> * [CI] Switch to GitHub actions (elastic#1236) (elastic#1245) Co-authored-by: Eric Beahan <ebeahan@gmail.com> Co-authored-by: Andrew Stucki <andrew.stucki@elastic.co> * Revert wildcard adoption back to experimental stage (elastic#1235) (elastic#1243) * Add scaled_float type to go generator (elastic#1250) (elastic#1251) * add scaled_float * changelog * Add categorization fields usage docs (elastic#1242) (elastic#1257) * add time_zone, postal_code, and continent_code (elastic#1229) (elastic#1258) * Specify MAC address format (elastic#456) (elastic#1260) Co-authored-by: Robin Schneider <36660054+ypid-geberit@users.noreply.github.com> * finalize 1.8.0 changelog (elastic#1262) (elastic#1265) * Add additional host fields (elastic#1248) (elastic#1267) Co-authored-by: kaiyan-sheng <kaiyan.sheng@elastic.co> * Stage 1 changes for RFC 0014 - extend pe fields (elastic#1256) (elastic#1270) * Add 2 fields to code_signature (elastic#1269) (elastic#1272) Co-authored-by: Yamin Tian <56367679+Trinity2019@users.noreply.github.com> * Stage 3 changes for RFC 0007 - remove beta attribute (elastic#1271) (elastic#1273) * Stage 1 experimental changes for RFC 0008 - threat.indicator fields (elastic#1268) (elastic#1274) * Stage 1 changes for RFC 0015 - add elf fieldset (elastic#1261) (elastic#1275) * Cut 1.9 FF CHANGELOG.next.md (elastic#1277) * lock go version in actions (elastic#1283) (elastic#1290) * Bump jinja2 from 2.11.2 to 2.11.3 in /scripts (elastic#1310) (elastic#1320) * Bump jinja2 from 2.11.2 to 2.11.3 in /scripts * Bump pyyaml from 5.3b1 to 5.4 in /scripts (elastic#1318) (elastic#1325) Co-authored-by: Eric Beahan <eric.beahan@elastic.co> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Adjust terminology - change whitelist to allowlist (elastic#1315) (elastic#1331) Co-authored-by: Dominic Page <11043991+djptek@users.noreply.github.com> * Remove -dev label from 1.9 version (elastic#1329) * remove -dev label from 1.9 version * generate artifacts * removing rules artifacts * Cut 1.9 changelog (elastic#1328) * move 1.9 changes to changelog * add 1.9 release changes
We'd like to introduce custom
multi_fields
definitions in the endpoint package's custom schema. An example of this is here:https://github.com/elastic/endpoint-package/pull/79/files#diff-7f0ee89a2e91f4b29aa03f75b80a16acR22-R26
Currently, the ECS scripts do not merge the
multi_fields
array but instead uses the custom schema's definition after merging the included files. Since the custom schema's definition overwrites the core schema's definition, the custom schema must include anymulti_fields
core elements in its definition otherwise they'll inadvertently be removed. The above example will result in thepath.text
field being removed: https://github.com/elastic/ecs/blob/master/schemas/file.yml#L62-L64This PR adds functionality to merge the custom
multi_fields
array with the core one. The approach I took was to convert the list into a map so we can perform deduplication. The keys in the map come from the list entries (which are a map)name
field. The included custom schema will override the core schema if it defines a multi_field entry with the samename
field.