Skip to content

ReadonlyFilesystem Condition with Auto-Recovery Detection#1234

Open
CharudathGopal wants to merge 1 commit intokubernetes:masterfrom
CharudathGopal:readonly-filesystem-revert
Open

ReadonlyFilesystem Condition with Auto-Recovery Detection#1234
CharudathGopal wants to merge 1 commit intokubernetes:masterfrom
CharudathGopal:readonly-filesystem-revert

Conversation

@CharudathGopal
Copy link

@CharudathGopal CharudathGopal commented Feb 12, 2026

ISSUE: #474

Problem Statement
The existing NPD SystemLogMonitor for ReadonlyFilesystem is a one-way trigger: it sets the condition to True when detecting Remounting filesystem read-only in kernel logs, but never clears it. The condition remains stuck until NPD pod restart or node reboot, causing some Kubernetes platform to delete the node.

Solution: CustomPluginMonitor with Active Recovery Detection

This PR introduces a CustomPluginMonitor (check_ro_filesystem.sh) that actively monitors both detection and recovery of read-only filesystems.

Key Features

  1. NPD Design Adherence & Detection (5-minute lookback)

    • Mimics NPD's SystemLogMonitor by scanning /dev/kmsg for "Remounting filesystem read-only" messages
    • Uses 5-minute lookback window (300 seconds) calculated from /proc/uptime
    • Filters kernel messages by timestamp to match NPD's behavior
    • Extracts device names from kernel messages (e.g., pxd!pxd208624217308167071, dm-3, sda1)
    • Checks if extracted devices are currently mounted read-only in /host/proc/1/mounts
    • If any device is RO: Exit 1 → Condition = True
  2. Recovery Check (all-time lookback)

    • If no recent errors found, performs second scan of /dev/kmsg with no time limit
    • Retrieves ALL historical "remounting filesystem read-only" messages since boot
    • Checks if ANY historically-problematic devices are still mounted read-only
    • If all devices recovered or unmounted: Exit 0 → Condition = False
    • Enables automatic recovery detection - the key differentiator from SystemLogMonitor
  3. Targeted Approach (Avoids False Positives)

    • Only monitors devices that kernel explicitly reported as having errors
    • Does NOT scan all mounts in /proc/mounts (which would flag legitimate read-only mounts like /boot, CD-ROMs, ConfigMaps)
    • More accurate and simpler than filesystem-wide scanning
    • Prevents false positives from intentionally read-only filesystems

Deployment Changes

ReadonlyFilesystem CustomPluginMonitor is not enabled by default, users can enable this feature by following steps documented in docs/readonly-recovery-plugin-monitor.md

  • Adds /host/proc volume mount to access host mount table
  • Replace --config.system-log-monitor=/config/readonly-monitor.json with --config.custom-plugin-monitor=/config/readonly-recovery-plugin-monitor.json

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 12, 2026
@k8s-ci-robot
Copy link
Contributor

Welcome @CharudathGopal!

It looks like this is your first PR to kubernetes/node-problem-detector 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/node-problem-detector has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 12, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @CharudathGopal. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: CharudathGopal
Once this PR has been reviewed and has the lgtm label, please assign sjenning for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 12, 2026
@CharudathGopal
Copy link
Author

@Random-Liu @sjenning Can you please take a look at this PR

@CharudathGopal
Copy link
Author

@hakman
Copy link
Member

hakman commented Feb 16, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 16, 2026
@CharudathGopal
Copy link
Author

/cc @Random-Liu @wangzhen127 @hakman @andyxning Tests have passed, appreciate any help in reviewing this change..

@CharudathGopal
Copy link
Author

Hello @hakman,

This feature is important for Portworx Enterprise Product which provides a comprehensive, software-defined, persistent container storage and data management platform, specifically designed for Kubernetes-based applications across on-premises, hybrid, and multi-cloud environments.

The changes we are making as part of this PR can help multiple platform like Anthos, Gardener to operate seamlessly with Portworx.

Appreciate any help with reviewing this change.

@hakman
Copy link
Member

hakman commented Feb 28, 2026

Hi @CharudathGopal — thanks for putting this together, and for the detailed work on it. I went through the PR, and I had to spend some time understanding the problem statement and the operational impact of the change.

Here’s my feedback:

  1. Split ownership of a single condition across monitors
    Having one monitor set a condition and another clear it can be race-prone in NPD (for a given condition type, the latest update wins). I’d strongly prefer that a single monitor owns both the assert and clear transitions for ReadonlyFilesystem, so the behavior is deterministic.

  2. Unreliable signal from /dev/kmsg
    The script treats “no remount messages found” as recovery. Since /dev/kmsg is a rotating ring buffer, older remount events can disappear even while the filesystem is still mounted read-only. That could lead to clearing ReadonlyFilesystem incorrectly.
    More generally, plugins ideally shouldn’t tail /dev/kmsg, and should rely on stronger signals (e.g., explicitly tracking and checking mount state / filesystem status).

  3. Unreadable /dev/kmsg reported as healthy
    If /dev/kmsg is unreadable, the script currently returns OK, which could clear an active ReadonlyFilesystem=True condition without evidence of recovery. I think this should be reported as Unknown / non-OK (or otherwise avoid clearing) rather than “healthy”.

  4. Complexity / maintainability (Bash + awk)
    The plugin logic is fairly large and not trivial to maintain/debug. It might be possible to simplify the logic and reduce the amount of parsing/state handling.

That said, NPD is modular. If this approach is working well for you and your users/customers as-is, I don't see any reason to not use it. I do agree it would be valuable to have this capability upstream, but I think the implementation details need a bit more discussion before we merge something like this.

@wangzhen127 @SergeyKanzhelev — curious if you have thoughts on the right approach here (especially around the signal/source of truth and condition ownership).

@CharudathGopal
Copy link
Author

Hello @hakman,

Thanks for the detailed review!

Before answering your questions/concerns I would like to mention I have tried to keep the design as similar to existing SystemLogMonitor (readonly-monitor.json).

1. Condition Ownership
We will explicitly ask customers to disable the existing SystemLogMonitor (readonly-monitor.json) when deploying our CustomPluginMonitor to avoid split ownership. Only one monitor manages the condition at a time, no race condition occurs.

2. /dev/kmsg Ring Buffer Limitation
Acknowledged. However, we're following the same principle as the existing readonly-monitor.json, both rely on /dev/kmsg as the source of truth. The difference: we add recovery detection on top of the same signal.
The ring buffer issue exists in the current implementation too. Our approach is: if the kernel message disappeared AND the device is no longer mounted RO (or unmounted), it's reasonable to clear the condition. This matches real-world recovery scenarios (Portworx unmounts failed volumes).

Even in case of SystemLogMonitor (readonly-monitor.json), when the POD is restarted and ring-buffer do not have the error message then it resets the node condition to false.

3. Unreadable /dev/kmsg = OK
This matches existing NPD behavior. When /dev/kmsg is unreadable:
Existing SystemLogMonitor: kmsgparser.NewParser() fails → Watch() returns error → monitor is skipped → no condition reported
Our plugin: /dev/kmsg unreadable → exit OK → no condition reported
Both approaches avoid false alarms. This is critical because some Kubernetes platforms (e.g., certain cloud providers) delete nodes when ReadonlyFilesystem=True is reported. Better to not report anything than risk node deletion on a monitoring failure.

4. Complexity
The complexity stems from handling diverse device naming schemes in production environments:
Portworx uses: pxd!pxd123, dm-X, /dev/mapper/pxd!pxd123, /dev/pxd/pxd123
Device-mapper symlinks require udevadm resolution
WWID-based devices, multipath devices, etc.

Initial approach: Scan all /proc/mounts for read-only filesystems.
Problem: Required extensive device name normalization and still produced false positives (legitimate RO mounts like /boot, ConfigMaps, CD-ROMs).

Current approach: Only check devices that kernel explicitly flagged in error messages. More targeted, fewer false positives, but requires robust path resolution.

I 100% agree that this needs a upstream solution, but getting this change in helps a lot of customers and multiple K8s Platform Providers to use this solution.

@CharudathGopal
Copy link
Author

@hakman FYI, forgot to mention earlier.

The doc I have added already talks about disabling readonly-monitor.json before using the customPluginMonitor we have here in this PR.

@SergeyKanzhelev @wangzhen127 Appreciate if you review the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants