Skip to content

glusterd snapshot (ZFS): snapshot delete reports success when the backend zfs destroy fails, leaving an untracked ZFS snapshot #4689

Description

@ThalesBarretto

Environment / provenance

  • Component: glusterd — snapshot ZFS provider.
  • Source state: present on devel at commit ae1d69672f (the line numbers below are at that commit),
    unchanged since introduction: the provider's zfs clone/zfs destroy lines come from 8a17e531c05 (the
    ZFS provider, PR snapshots: Support for ZFS Snapshots #2855, 2021-12-25); the generic swallow in glusterd_snapshot_remove from the snapshot
    pluginization commit 1787def0ab1 (2021-12-17).
  • Reproduced on: a shipping GlusterFS 11.2 + ZFS 2.4.2 system (single-brick volume). The defect
    is per-brick (the provider acts per brick; the swallow is in the generic per-snapshot remove path), so a
    multi-brick volume would orphan one backend snapshot per affected brick — not separately reproduced here.

Summary

snapshot clone (and snapshot restore) create the backend snapshot's clone with zfs clone and never
zfs promote it, so the clone keeps origin == <snapshot> — a normal ZFS dependency. A later
snapshot delete of that source snapshot runs a plain zfs destroy <snapshot> (no -R), which ZFS
correctly refuses while a dependent clone exists, returning a nonzero status. glusterd logs that failure at
ERROR but then resets the transaction return code to 0 during directory cleanup, so the delete reports
success: the CLI prints snap removed successfully (exit 0) and glusterd drops the snapshot object, while
the backend ZFS snapshot remains — now untracked by glusterd.

The defect is the silent-to-the-operator success on a real backend failure. (It is logged at ERROR in
glusterd.log, but the CLI/RPC result is success.) The dependent clone itself is a normal, glusterd-tracked
clone volume; the leaked object is the backend ZFS snapshot it holds.

Affected code

Both clone and restore build the clone with zfs clone, never zfs promote
(xlators/mgmt/glusterd/src/snapshot/glusterd-zfs-snapshot.c):

/* snapshot clone   -> .clone   -> glusterd_zfs_snapshot_create_or_clone(): */
    runner_add_args(&runner, ZFS_COMMAND, "clone", snap_device, clone_device, NULL);  /* line 179 */
/* snapshot restore -> .restore -> glusterd_zfs_snapshot_restore():          */
    runner_add_args(&runner, ZFS_COMMAND, "clone", snap_device, clone_device, NULL);  /* line 430 */

Remove is a plain zfs destroy, no -R (its only removal primitive — there is no zfs promote/-R
anywhere in the provider), glusterd_zfs_snapshot_remove:

    runner_add_args(&runner, ZFS_COMMAND, "destroy", snap_device, NULL);  /* line 332 */
    ret = runner_run(&runner);
    if (ret) { gf_msg(... GD_MSG_SNAP_REMOVE_FAIL ...); goto out; }       /* returns nonzero */

The nonzero return is dropped by the caller, glusterd_snapshot_remove (glusterd-snapshot.c):

    glusterd_snapshot_plugin_by_name(snap_vol->snap_plugin, &snap_ops);
    ret = snap_ops->remove(brickinfo, ...);      /* ~line 2600 */
    if (ret) {
        gf_msg(... GD_MSG_SNAP_REMOVE_FAIL ...);  /* ~line 2603: logged at ERROR — but no `goto out` */
    }
    ret = sys_rmdir(snap_path);                   /* ~2612: overwrites the failure with the cleanup result … */
    if (ret) { ret = 0; ... }                     /* ~2616/2637: … and forces ret = 0 even if rmdir failed */
out:
    return ret;                                   /* 0 — the backend failure has been overwritten */

So the failure is logged but execution falls through; ret is then reassigned by sys_rmdir (and forced to
0 if that also fails), so the original backend-removal failure is unconditionally overwritten and the
transaction returns success. (snap_ops->remove dispatches through the plugin vtable — selected in
glusterd-snapshot-utils.c — so this caller is shared by the LVM backend too.)

Root cause

A ZFS snapshot with a dependent clone genuinely cannot be removed by a plain zfs destroy — that is
correct ZFS behaviour, because the clone shares the snapshot's blocks. The bug is not that the destroy
fails; it is that glusterd_snapshot_remove does not propagate that failure (it logs it, falls through, and
a later cleanup step resets the return code to 0), so a recoverable backend failure is reported to the
operator as success and the snapshot object is dropped.

Reproduction

Single-brick volume on a ZFS dataset (so the provider selects zfs_snap_ops):

gluster volume create gv0 <host>:/<pool-mount>/gv0/brick force && gluster volume start gv0
gluster snapshot create snap1 gv0 no-timestamp     # -> zfs snapshot pool/gv0@<snapid>
gluster snapshot activate snap1
gluster snapshot clone clonevol snap1              # -> zfs clone (line 179): clonevol's brick.origin = snap1's snapshot
gluster snapshot delete snap1                      # reports success (see below)

Observed output (GlusterFS 11.2, ZFS 2.4.2; identifiers shortened):

$ gluster snapshot delete snap1 ; echo "exit=$?"
snapshot delete: snap1: snap removed successfully
exit=0

$ gluster snapshot list
No snapshots present                                 # glusterd no longer tracks snap1

$ zfs list -t snapshot
pool/gv0@<snapid>                                    # the snapshot is still here

$ zfs get -H -o value origin pool/gv0/<cloneid>      # clonevol's brick still depends on it:
pool/gv0@<snapid>
$ zfs get -H -o value clones pool/gv0@<snapid>
pool/gv0/<cloneid>

$ zfs destroy pool/gv0@<snapid>                      # the provider's plain destroy still can't remove it:
cannot destroy 'pool/gv0@<snapid>': snapshot has dependent clones
use '-R' to destroy the following datasets:
pool/gv0/<cloneid>

The matching glusterd.log lines for that delete (logged at ERROR while the CLI returned success):

E [MSGID: 106043] [glusterd-zfs-snapshot.c:337:glusterd_zfs_snapshot_remove] 0-management: removing snapshot of the brick (<brick-path>) of snapshot snap1 failed
E [MSGID: 106043] [glusterd-snapshot.c:2603:glusterd_snapshot_remove] 0-management: Failed to remove the snapshot <brick-path>/.zfs/snapshot/<snapid>/brick (snap1)

snapshot restore is also implemented via an un-promoted zfs clone (line 430), creating the same kind of
snapshot↔clone dependency; only the snapshot clone path above was reproduced end-to-end (it is the
cleanest trigger).

Impact

A snapshot delete that did not remove the backend snapshot is reported to the operator as success, and
glusterd forgets the snapshot. The ZFS snapshot is left behind — no longer removable by the provider's plain
zfs destroy while the clone depends on it, and no longer referenced by glusterd, so nothing will ever
retry it. It accumulates whenever a snapshot is cloned-from (or restored-from) and that source snapshot is
later deleted while the clone still exists. This is not data loss (no -R is issued), but it is a correctness
bug (success reported for a failed operation) and a steady consumer of pool space that is invisible to
gluster snapshot list.

Suggested fix

The delete should fail with a clear, actionable error instead of reporting success — e.g. "cannot delete
snapshot snap1: it is in use by clone volume clonevol; delete the clone first."
The dependent clone is
discoverable via zfs get -H -o value clones <snap_device>.

The clean place to reject this is the prevalidate phase. glusterd_snapshot_remove_prevalidate()
(dispatched for delete by glusterd_snapshot_prevalidate) already runs before the op commits; having the
ZFS provider expose a "has dependent clones" check there lets the delete be refused before any state
changes.

This phase choice matters: today the failure surfaces only at commit time, after
glusterd_snapshot_remove_commit has already logged "marked snap … for decommission" (see the log above).
Merely propagating the error from glusterd_snapshot_remove (which currently logs it, then has ret
overwritten by the sys_rmdir cleanup) would fail the op but could leave a partially-decommissioned
snapshot. Rejecting in prevalidate avoids that half-state; a commit-phase guard, if also wanted, needs
proper rollback. Exact placement is the implementer's call.

Either way, the existing tolerant behaviour for a backend snapshot that is already gone (where continuing
is correct) should be preserved — only the has-dependent-clone case should be refused. A blanket change to
stop the caller overwriting ret is not proposed: that path is shared with the LVM backend and the
auto-delete/GC callers and would need a wider audit.

Why not zfs promote?

The standard ZFS way to drop a snapshot that has dependents is zfs promote, but it is not a safe fix here
(observed on ZFS 2.4.2): zfs promote <clone> does not free the snapshot — it relocates it into the clone's
namespace and swaps roles, so the original snapshot name no longer exists (a follow-up
zfs destroy <snap_device> then exits 1, "could not find any snapshots to destroy"), and it makes the
live brick a dependent clone of the snapshot-clone volume. After that, a zfs destroy -R of the clone
volume would also destroy the live brick. Failing the delete (above) avoids creating that hidden
dependency.

Notes

  • A PR for the provider-scoped fix can follow; happy to align on the preferred shape first.
  • A formal model of the snapshot lifecycle and the full annotated testbed transcript (including the
    zfs promote relocation/cascade observations) are available on request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions