You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a Filesystem is not available temporarily or during failover to another node, the system-calls may return errors such as ESTALE or EIO while it is unavailable - see for example GPFS.
This can happen when the FS daemon is down, or some other internal component of the FS or device is not responding.
It is not always possible to differentiate accurately if this state is transient or long term, but the assumption is that it would be resolved sooner or later, either by automated failover to another node, automated recovery, or manual.
NSFS does not identify these errors specifically, and treats these errors as general internal errors, and therefore returns InternalError to the S3 client request immediately.
S3 clients vary in their retries options and configuration, but most clients will have a few retry attempts (3-5) with some exponential backoff.
For some S3 clients the retry attempts and strategy is easily configurable (e.g aws cli) but for other clients it is not always so configurable, and on top of that it becomes the application responsibility to adapt to the storage failure modes which is error prone.
Even if the Filesystem is able to recover or failover, S3 clients might give up due to retries exhausted.
Expected behavior
S3 endpoints should be able to hold the client requests (as long as it was not timeout/aborted by the client).
By holding the request the client will have better chances to overcome the temporary unavailability.
NSFS filesystem calls should be wrapped with a retry_temporary_fs_errors function that will be able to detect temporary errors, and keep the context alive and retry the FS call with some backoff.
An important point is that this retry has to also detect if the client request was aborted.
We would need config options to control the maximum time/count to retry and be able to enable/disable this behavior to fit into different systems.
We also need to be able to differentiate the temporary unavailability errors from others, for example ESTALE is not always representing a recoverable error, so this detection may differ between different FS backends.
The text was updated successfully, but these errors were encountered:
Environment info
Actual behavior
Expected behavior
The text was updated successfully, but these errors were encountered: