Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datadog_logs sink incorrectly moves the host attribute into the hostname attribute #17147

Open
mrzor opened this issue Apr 13, 2023 · 10 comments
Labels
meta: regression This issue represents a regression sink: datadog_logs Anything `datadog_logs` sink related type: bug A code related bug.

Comments

@mrzor
Copy link

mrzor commented Apr 13, 2023

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

After upgrading to 0.28.2, we noticed our EC2 metadata was not reconciled after Datadog intake anymore, the hostname attribute contained the host IP (which should be contained in the host attribute) and the host attribute was missing. By adding a mitmproxy between Vector and the Datadog intake, we confirmed that the logs leaving Vector exhibited this defect, and so it was unrelated to JSON attribute preprocessing that happens after intake.

Configuration

In our setup, events have both host and hostname attributes set before reaching Vector. On Datadog's side, the JSON attribute preprocessing for host is set to hostname. In our setup, the host ip address is stored in the host attribute, and the instance id stored in the hostname attribute. Having the instance ID recognized as host by Datadog is central to having EC2 metadata reconciled past the Datadog intake, but given our setup, this requires it to be in the hostname attribute - a bit convoluted, but that's how things have been setup a while back.

Version

0.28.2

Possible solutions

We believe the change from host to hostname in faa96f8 had this unforeseen consequence. Whereas Vector previously "renamed" the host attribute to host leaving the hostname attribute intact, the change causes our hostname attribute to be erased by the host attribute, and the host attribute dropped. We believe the best course of action is for this part of the change to be reverted.

We could alter our configuration in a number of ways, notably, we could remap the hostname attribute to hostname_ and use Datadog's JSON preprocessing facilities to achieve the EC2 metadata reconciliation. We would also need to set the host_key to something else than host to dodge the unwanted renaming.

@mrzor mrzor added the type: bug A code related bug. label Apr 13, 2023
@mrzor mrzor changed the title datadog_logs issue with host/hostname attributes datadog_logs sink incorrectly moves the host attribute into the hostname attribute Apr 13, 2023
@neuronull
Copy link
Contributor

Hello,

Indeed it looks like we overlooked the impact this change would have to users- we failed to mention this change in behavior in the v0.28.0 upgrade guide (where this change was first introduced). I will open a PR to add this note to the upgrade guide.

However, as to correctness, the change was made in order to adhere to the Datadog Agent spec (https://docs.datadoghq.com/api/latest/logs/) , making it in line with the requirements for this source.

@mrzor
Copy link
Author

mrzor commented Apr 15, 2023

Hello!

I believe the linked API documentation is by no means prescriptive as to how events should be submitted to Datadog.
The canonical attribute is elsewhere defined as host , notably in the reserved attributes list. It resolves (by default) to either the host , hostname or syslog.hostname attribute, according to the log pipelines documentation. It can also resolve to any other attribute configured through the "JSON Preprocessing for logs" step of the log pipelines.

The schema you linked seems incomplete. I see it as a valid example, but not as a rigid prescription. The 6 reserved attributes, tell a different story where host is the expected value. The trace_id reserved attribute is part of the reserved attributes, but absent from the API example, highlighting the fact that the API example is not exhaustive, and maybe making the point that it's not prescriptive either.

I did run a test to submit an event to the intake without the hostname attribute but with the host attribute, with:

$ curl -vvv -X POST "https://http-intake.logs.datadoghq.eu/api/v2/logs" \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-H "DD-API-KEY: ${DD_API_KEY}" \
-d @- << EOF
{
  "ddsource": "zor",
  "ddtags": "env:ist,owner:zor",
  "host": "i-012345678",
  "message": "2019-11-19T14:37:58,995 INFO [process.name][20081] Hello World",
  "service": "zorism"
}
EOF

The HTTP response was 202, and I found the event in the log explorer shortly thereafter:

without issues


The forced attribute remapping introduced in 0.28 makes for a real regression in our environment because we loose EC2 metadata in addition to loosing the information previously contained in the hostname attribute.

Should you keep the stance that the way things are since 0.28 is desirable, would you agree on providing an opt out from this forced attribute renaming?

@spencergilbert
Copy link
Contributor

Sorry for the delay here, issue got dropped when we were rotating our GH responder.

I spoke with @neuronull today and we're in agreement about reverting this change. I'm not sure if we'll cut a patch release or backport this to 0.28.0, I'll update you once that's decided.

Additionally I'll open an issue to track collaborating with the logs intake team at Datadog to ensure we have a better contract and don't cause these sorts of regressions going forward.

@jszwedko
Copy link
Member

jszwedko commented May 1, 2023

Hi @mrzor !

We had some discussion today about this issue and decided we feel that the new behavior is more consistent with other sinks (and internally consistent within the datadog_logs sink) and thus more likely to be expected by Vector users despite, unfortunately causing a regression to your use-case. As an example, if you had log_schema.timestamp_key set to something other than it's default of timestamp, say time, the same overwriting of any timestamp field that happens to be in the event would happen in the datadog_logs sink.

I realize it causes suboptimal behavior in your case though where you don't want to overwrite the hostname already present in the log event; I just think any updates to that encoding behavior should be through a new "guiding principle" for sink encoding to maintain consistency within Vector rather than having the datadog_logs sink have special behavior for the host field. An example new principle might be: map fields in sinks to their expected locations, but don't overwrite fields if they are already present in the log event. Or perhaps we could allow configuration of the host_key at the sink level so that you could set it to hostname?

@mrzor
Copy link
Author

mrzor commented May 3, 2023

Hi Jesse!

Thanks for the update and the re-ignited activity on this topic. I'll do my best to highlight the number of reasons why I think that rollback is the way to go alongside some suggestions.

The reference to the new log schemas in the documentation remain quite unclear to me, so I find it hard to articulate something useful in that context. I can however attempt to answer the direct questions you put forward.

if you had log_schema.timestamp_key set to something other than it's default of timestamp, say time, the same overwriting of any timestamp field that happens to be in the event would happen in the datadog_logs sink.

This is half of the issue. The other half of the issue is that the original key is moved and not duped. So not only something is overwritten, but an key-value pair also gets mauled (in my case, host).

map fields in sinks to their expected locations, but don't overwrite fields if they are already present in the log event.

If my understanding of the guideline is correct, having expected locations for anything in the first place seems in contradiction with "Vector does not want to be in the business of maintaining log schemas. An exercise that has proven precarious with the proliferation of log schema standards.".

An exception can be made for the timestamp field, but this is also a can of worms: timezone issues and clock drifts are the issues we run into most often.

While most (all?) log systems require timestamps to be meaningful, should Vector be opinionated in that area? Why wouldn't it be left to the configuration instead of default behavior. This is the position the guideline states, if I understand it correctly.

Anecdotally, I believe it is valid to send non timestamped logs to Datadog, in which case they'll be timestamped at intake time. This also happens when the timestamp is outright invalid. Few classes of pathological timestamps (like way in the future or way in the past) result in rejection.

Or perhaps we could allow configuration of the host_key at the sink level so that you could set it to hostname?

When a specific sink adds requirements on the log payload, such as having a mandatory host_key, then perhaps the existing schema infrastructure can be leveraged to produce warnings or even configuration errors, forcing the user to produce a configuration that respects such requirements. Warnings feel better because it empowers the user to have the last word, which feels alright to me. For instance, let's say you have a UnicornLogs sink that requires the rainbow attribute to be set. Maybe I'm piping it to something that implements UnicornLogs without such a requirement - in which case normalizing it for me, or forcing me to add it, is actively adverse to the intent.


In the context of this specific issue, it feels like the unexpected behavior change stems from a good intention, namely to conform to Datadog's log intake specifications. However, the API docs have been taken as normative where they're just advisory - I'm confident you will find someone internally that will confirm this, possibly @gus on public Slack.

If they are indeed advisory, then maybe the Vector Datadog sink should limit its actions to advices as well? That would keep it in tune with the service it represents.

@spencergilbert
Copy link
Contributor

spencergilbert commented May 4, 2023

Just wanted to drop a quick update, without any actual decision from our end @mrzor - according to the team that owns the API, the documented body (with hostname) is the canonical use of the API. That said, given how the API works (and as you pointed out about timestamp not being required...) it accepts quite a range of inputs.

I'm writing up some thoughts and proposals to discuss with the team in the near future.

@spencergilbert
Copy link
Contributor

spencergilbert commented May 8, 2023

Dropping some additional notes as I do research here, the datadogexporter for OpenTelemetry which is in a similar position of converting arbitrary logs into a payload for the Datadog Logs API.

They're following a similar pattern of looking for the usual attributes and renaming it to what the API expects (they also rename to hostname).

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/datadogexporter/internal/logs/translator.go#L68

@mrzor
Copy link
Author

mrzor commented Jun 7, 2023

Hello!

We are biting the bullet and going along with the change - being stuck on 0.27 is painful. We like our software up to date :)

Here's what we're rolling with - it keeps the AWS integration working, and the data we used to store in .host is now in .host_

# Move our host attribute out of the way
# See below.
.host_ = del(.host)

# Set our hostname attribute to the host tag
# This works around https://github.com/vectordotdev/vector/issues/17147
# Vector is going to erase the hostname attribute with the value of the host
# attribute. To avoid this in our case, we set them to the same value.
# But to keep our instance IP or hostname, we saved our incoming host attribute
# to host_.
# All monitors and dashboards relying on the host attribute had to be updated as
# a result.
.host = .hostname

@jszwedko
Copy link
Member

Thanks for the update @mrzor ! We are still going back-and-forth on the expected behavior here 😓

@spencergilbert
Copy link
Contributor

Recapping some info for future readers. Today we do something similar to the OpenTelemetry datadogexporter, in which we look for "expected" fields to build the payload (Transform). We could follow this pattern more closely and use additional fallbacks to stay more aligned with the datadogexporter.

The logs API itself will generally accept any object and render it best effort for the frontend, and users can configure their JSON Preprocessing to target specific fields for Datadog attributes.

Longer term we could force the handling of this translation via semantic meaning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta: regression This issue represents a regression sink: datadog_logs Anything `datadog_logs` sink related type: bug A code related bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants