Skip to content

Journal: A "bad message" error results in an infinite hot-loop #4053

Closed
@href

Description

@href

Description
The systemd journal library may at times follow logs that it cannot process. When that happens, promtail enters an infinite loop that will produce the following error messages as fast as it can (with newlines for claritry):

Jul 23 08:24:35 host promtail[425]:
level=error
ts=2021-07-23T06:24:35.747004119Z
caller=journaltarget.go:185
msg="received error during sdjournal follow"
err="failed to iterate journal: bad message"

This is likely related to #2928, which introduced a loop over the journal follow logic.

Environment
Systemd 241 and 245

promtail, version 2.2.1 (branch: HEAD, revision: babea82e)
build user:       root@e2d295b84e26
build date:       2021-04-06T00:52:41Z
go version:       go1.15.3
platform:         linux/amd64

Theory
The intention of #2928 is to skip log entries when "bad message" occurs, but from my study of the source code involved, this is likely not what's happening:

The error clearly bubbles out of Follow in the following snippet:

err := t.r.Follow(until, ioutil.Discard)
if err != nil {
if err == sdjournal.ErrExpired || err == io.EOF {
return
}
level.Error(t.logger).Log("msg", "received error during sdjournal follow", "err", err.Error())
}

The error is logged and Follow is called again. For it to produce a differing result, the current line would have to be skipped at some point, or we are likely reading the same result over and over again (which is what we are observing).

Inside Follow the Read function is called, which is the function that advances the current cursor here:

func (r *JournalReader) Read(b []byte) (int, error) {
if r.msgReader == nil {
// Advance the journal cursor. It has to be called at least one time
// before reading
c, err := r.journal.Next()
// An unexpected error
if err != nil {
return 0, err
}

Internally, this calls sd_journal_next, which in turn calls real_journal_next:
https://github.com/systemd/systemd/blob/91d0750dbf65e1ffa627fa880c50673a27758cf6/src/libsystemd/sd-journal/sd-journal.c#L816

Towards the end of that function, journal_file_move_to_object is called. This is where EBADMSG comes from! Unfortunately, reading a bad message does not advance the cursor, as an error forces to function to exit before that:
https://github.com/systemd/systemd/blob/91d0750dbf65e1ffa627fa880c50673a27758cf6/src/libsystemd/sd-journal/sd-journal.c#L861-L867

Expected Behavior
Ideally we would like to skip bad messages, but I'm not sure this is possible, as the skip functions systemd offers are really just "next" calls with a counter. It might make sense to treat these errors like end-of-file errors, since this seems to be what journald is doing internally:

https://unix.stackexchange.com/a/87862/413524

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions