Journal: A "bad message" error results in an infinite hot-loop

**Description**
The systemd journal library may at times follow logs that it cannot process. When that happens, `promtail` enters an infinite loop that will produce the following error messages as fast as it can (with newlines for claritry):

    Jul 23 08:24:35 host promtail[425]:
    level=error
    ts=2021-07-23T06:24:35.747004119Z
    caller=journaltarget.go:185
    msg="received error during sdjournal follow"
    err="failed to iterate journal: bad message"

This is likely related to #2928, which introduced a loop over the journal follow logic.

**Environment**
Systemd 241 and 245

    promtail, version 2.2.1 (branch: HEAD, revision: babea82e)
    build user:       root@e2d295b84e26
    build date:       2021-04-06T00:52:41Z
    go version:       go1.15.3
    platform:         linux/amd64

**Theory**
The intention of #2928 is to skip log entries when "bad message" occurs, but from my study of the source code involved, this is likely not what's happening:

The error clearly bubbles out of `Follow` in the following snippet:
https://github.com/grafana/loki/blob/b4086dfe3350aecf907a5fb6e9f1ce2c0a638d29/clients/pkg/promtail/targets/journal/journaltarget.go#L181-L187

The error is logged and `Follow` is called again. For it to produce a differing result, the current line would have to be skipped at some point, or we are likely reading the same result over and over again (which is what we are observing).

Inside `Follow` the `Read` function is called, which is the function that advances the current cursor here:
https://github.com/grafana/loki/blob/babea82ef558c4b4dc685fd4ca677de1f0a7ee8f/vendor/github.com/coreos/go-systemd/sdjournal/read.go#L138-L147

Internally, this calls `sd_journal_next`, which in turn calls `real_journal_next`:
https://github.com/systemd/systemd/blob/91d0750dbf65e1ffa627fa880c50673a27758cf6/src/libsystemd/sd-journal/sd-journal.c#L816

Towards the end of that function, `journal_file_move_to_object` is called. This is where `EBADMSG` comes from! Unfortunately, reading a bad message does not advance the cursor, as an error forces to function to exit before that:
https://github.com/systemd/systemd/blob/91d0750dbf65e1ffa627fa880c50673a27758cf6/src/libsystemd/sd-journal/sd-journal.c#L861-L867

**Expected Behavior**
Ideally we would like to skip bad messages, but I'm not sure this is possible, as the skip functions systemd offers are really just "next" calls with a counter. It might make sense to treat these errors like end-of-file errors, since this seems to be what `journald` is doing internally:

https://unix.stackexchange.com/a/87862/413524

	err := t.r.Follow(until, ioutil.Discard)
	if err != nil {
	if err == sdjournal.ErrExpired \|\| err == io.EOF {
	return
	}
	level.Error(t.logger).Log("msg", "received error during sdjournal follow", "err", err.Error())
	}

	func (r *JournalReader) Read(b []byte) (int, error) {
	if r.msgReader == nil {
	// Advance the journal cursor. It has to be called at least one time
	// before reading
	c, err := r.journal.Next()

	// An unexpected error
	if err != nil {
	return 0, err
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Journal: A "bad message" error results in an infinite hot-loop #4053

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Journal: A "bad message" error results in an infinite hot-loop #4053

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions