Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working fluentd stopped following tail for one file #3800

Closed
bjg2 opened this issue Jun 27, 2022 · 2 comments
Closed

Working fluentd stopped following tail for one file #3800

bjg2 opened this issue Jun 27, 2022 · 2 comments

Comments

@bjg2
Copy link

bjg2 commented Jun 27, 2022

Describe the bug

We're using fluentd in k8s to fetch all the tailing logs and send it to a mongo database. Generally, it works as expected.

After long time, it happens randomly that fluentd stops tailing a file, so we lost important log data for more than 1 day. Other fluentd apps (on other nodes) worked as expected, and given node worked as expected for everything except for that tailing file. This is not the first time it happened (last time we noticed it on May 6th).

That tailed file is a log file of a pod that has huge amount of logs. Prior to the issue pod was running 3-4 days without stop without any issues with logging. Last log that passed through from that file was at "2022-06-25T19:17:35.751Z". Before that moment, each time that rotation of that file happened, there was a pair of logs: detected rotation + following tail. Later, each time that rotation happened, it logger only: detected rotation.

Before (last, and each before):

2022-06-25T19:13:05.741980613Z stdout F 2022-06-25 19:13:05 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/api-6d756f4555-jt7zz_default_api-c7e8458c243f365986db0861cb3f73de09d1d4b61a045f52276465c6c3559290.log; waiting 5 seconds
2022-06-25T19:13:05.743587912Z stdout F 2022-06-25 19:13:05 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/api-6d756f4555-jt7zz_default_api-c7e8458c243f365986db0861cb3f73de09d1d4b61a045f52276465c6c3559290.log

After (first, and every next):

2022-06-25T19:17:36.744586203Z stdout F 2022-06-25 19:17:36 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/api-6d756f4555-jt7zz_default_api-c7e8458c243f365986db0861cb3f73de09d1d4b61a045f52276465c6c3559290.log; waiting 5 seconds

To note file at /var/log/containers/api-6d756f4555-jt7zz_default_api-c7e8458c243f365986db0861cb3f73de09d1d4b61a045f52276465c6c3559290.log looked healthy and ok, and was constantly outputting healthy logs (it was rotated once each ~12 minutes). We mitigated by just restarting that api pod - it changed name and that changed file name, which fluentd followed fine later. Problem existed since that problematic point in time until we did mitigation, for only that file. There was no need to restart fluentd.

To Reproduce

Can't repro.

Expected behavior

Not to stop tailing.

Your Environment

- Fluentd version: fluentd 1.13.3
- TD Agent version: /
- Operating system: 
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.13.5
PRETTY_NAME="Alpine Linux v3.13"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://bugs.alpinelinux.org/"

- Kernel version: 4.19.0-17-cloud-amd64

(Using docker image fluent/fluentd:v1.13-1)

Your Configuration

Cannot repro, so I'm putting whole config in (db config is replaced when deploying).

# Inputs from container logs
<source>
  @type tail
  @id in_tail_container_logs
  path /var/log/containers/*.log
  pos_file /var/log/fluentd.log.pos
  read_from_head
  tag kubernetes.*
  <parse>
    @type cri
  </parse>
</source>

# Merge logs split into multiple lines
<filter kubernetes.**>
  @type concat
  key message
  use_partial_cri_logtag true
  partial_cri_logtag_key logtag
  partial_cri_stream_key stream
  separator ""
</filter>

# Enriches records with Kubernetes metadata
<filter kubernetes.**>
  @type kubernetes_metadata
</filter>

# Prettify kubernetes metadata
<filter kubernetes.**>
  @type record_transformer
  enable_ruby
  <record>
    nodeName ${record.dig("kubernetes", "host")}
    namespaceName ${record.dig("kubernetes", "namespace_name")}
    podName ${record.dig("kubernetes", "pod_name")}
    containerName ${record.dig("kubernetes", "container_name")}
    containerImage ${record.dig("kubernetes", "container_image")}
  </record>
  remove_keys docker,kubernetes
</filter>
 
# Expands inner json
<filter kubernetes.**>
  @type parser
  format json
  key_name message
  reserve_data true
  remove_key_name_field true
  emit_invalid_record_to_error false
  time_format %Y-%m-%dT%H:%M:%S.%NZ
  time_key time
  keep_time_key
</filter>

# Mongodb keys should not have dollar or a dot inside
<filter kubernetes.**>
  @type rename_key
  replace_rule1 \$ [dollar]
</filter>

# Mongodb keys should not have dollar or a dot inside
<filter kubernetes.**>
  @type rename_key
  replace_rule1 \. [dot]
</filter>

# Outputs to log db
<match kubernetes.**>
  @type mongo

  connection_string "#{ENV['MONGO_ANALYTICS_DB_HOST']}"
  collection logs

  <buffer>
    flush_thread_count 8
    flush_interval 1s
  </buffer>
</match>

Your Error Log

Not sure how you mean all - we have a ton of logs, most of them lost. The best logs I could find were noted in the 'Describe the bug' part, you can ask for some specifics if needed.

Additional context

Running docker image fluent/fluentd:v1.13-1 in k8s 1.21 on Digital Ocean as a daemon set.

@bjg2
Copy link
Author

bjg2 commented Jul 8, 2022

Ping, any advice on this one? It happens unexpectedly from time to time and is very very painful, it takes a lot of our logs and messes up our analytics...

@ashie
Copy link
Member

ashie commented Jul 8, 2022

It seems same issue with #3661 or #3614 or #3586, let's continue to discuss at there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants