You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Source state: present on devel at commit ae1d69672f (the line numbers below are at that commit),
unchanged since introduction: the provider's zfs clone/zfs destroy lines come from 8a17e531c05 (the
ZFS provider, PR snapshots: Support for ZFS Snapshots #2855, 2021-12-25); the generic swallow in glusterd_snapshot_remove from the snapshot
pluginization commit 1787def0ab1 (2021-12-17).
Reproduced on: a shipping GlusterFS 11.2 + ZFS 2.4.2 system (single-brick volume). The defect
is per-brick (the provider acts per brick; the swallow is in the generic per-snapshot remove path), so a
multi-brick volume would orphan one backend snapshot per affected brick — not separately reproduced here.
Summary
snapshot clone (and snapshot restore) create the backend snapshot's clone with zfs clone and never zfs promote it, so the clone keeps origin == <snapshot> — a normal ZFS dependency. A later snapshot delete of that source snapshot runs a plain zfs destroy <snapshot> (no -R), which ZFS
correctly refuses while a dependent clone exists, returning a nonzero status. glusterd logs that failure at
ERROR but then resets the transaction return code to 0 during directory cleanup, so the delete reports
success: the CLI prints snap removed successfully (exit 0) and glusterd drops the snapshot object, while
the backend ZFS snapshot remains — now untracked by glusterd.
The defect is the silent-to-the-operator success on a real backend failure. (It is logged at ERROR in glusterd.log, but the CLI/RPC result is success.) The dependent clone itself is a normal, glusterd-tracked
clone volume; the leaked object is the backend ZFS snapshot it holds.
Affected code
Both clone and restore build the clone with zfs clone, never zfs promote
(xlators/mgmt/glusterd/src/snapshot/glusterd-zfs-snapshot.c):
The nonzero return is dropped by the caller, glusterd_snapshot_remove (glusterd-snapshot.c):
glusterd_snapshot_plugin_by_name(snap_vol->snap_plugin, &snap_ops);
ret=snap_ops->remove(brickinfo, ...); /* ~line 2600 */if (ret) {
gf_msg(... GD_MSG_SNAP_REMOVE_FAIL ...); /* ~line 2603: logged at ERROR — but no `goto out` */
}
ret=sys_rmdir(snap_path); /* ~2612: overwrites the failure with the cleanup result … */if (ret) { ret=0; ... } /* ~2616/2637: … and forces ret = 0 even if rmdir failed */out:
returnret; /* 0 — the backend failure has been overwritten */
So the failure is logged but execution falls through; ret is then reassigned by sys_rmdir (and forced to
0 if that also fails), so the original backend-removal failure is unconditionally overwritten and the
transaction returns success. (snap_ops->remove dispatches through the plugin vtable — selected in glusterd-snapshot-utils.c — so this caller is shared by the LVM backend too.)
Root cause
A ZFS snapshot with a dependent clone genuinely cannot be removed by a plain zfs destroy — that is
correct ZFS behaviour, because the clone shares the snapshot's blocks. The bug is not that the destroy
fails; it is that glusterd_snapshot_remove does not propagate that failure (it logs it, falls through, and
a later cleanup step resets the return code to 0), so a recoverable backend failure is reported to the
operator as success and the snapshot object is dropped.
Reproduction
Single-brick volume on a ZFS dataset (so the provider selects zfs_snap_ops):
$ gluster snapshot delete snap1 ; echo "exit=$?"
snapshot delete: snap1: snap removed successfully
exit=0
$ gluster snapshot list
No snapshots present # glusterd no longer tracks snap1
$ zfs list -t snapshot
pool/gv0@<snapid> # the snapshot is still here
$ zfs get -H -o value origin pool/gv0/<cloneid> # clonevol's brick still depends on it:
pool/gv0@<snapid>
$ zfs get -H -o value clones pool/gv0@<snapid>
pool/gv0/<cloneid>
$ zfs destroy pool/gv0@<snapid> # the provider's plain destroy still can't remove it:
cannot destroy 'pool/gv0@<snapid>': snapshot has dependent clones
use '-R' to destroy the following datasets:
pool/gv0/<cloneid>
The matching glusterd.log lines for that delete (logged at ERROR while the CLI returned success):
E [MSGID: 106043] [glusterd-zfs-snapshot.c:337:glusterd_zfs_snapshot_remove] 0-management: removing snapshot of the brick (<brick-path>) of snapshot snap1 failed
E [MSGID: 106043] [glusterd-snapshot.c:2603:glusterd_snapshot_remove] 0-management: Failed to remove the snapshot <brick-path>/.zfs/snapshot/<snapid>/brick (snap1)
snapshot restore is also implemented via an un-promoted zfs clone (line 430), creating the same kind of
snapshot↔clone dependency; only the snapshot clone path above was reproduced end-to-end (it is the
cleanest trigger).
Impact
A snapshot delete that did not remove the backend snapshot is reported to the operator as success, and
glusterd forgets the snapshot. The ZFS snapshot is left behind — no longer removable by the provider's plain zfs destroy while the clone depends on it, and no longer referenced by glusterd, so nothing will ever
retry it. It accumulates whenever a snapshot is cloned-from (or restored-from) and that source snapshot is
later deleted while the clone still exists. This is not data loss (no -R is issued), but it is a correctness
bug (success reported for a failed operation) and a steady consumer of pool space that is invisible to gluster snapshot list.
Suggested fix
The delete should fail with a clear, actionable error instead of reporting success — e.g. "cannot delete
snapshot snap1: it is in use by clone volume clonevol; delete the clone first." The dependent clone is
discoverable via zfs get -H -o value clones <snap_device>.
The clean place to reject this is the prevalidate phase. glusterd_snapshot_remove_prevalidate()
(dispatched for delete by glusterd_snapshot_prevalidate) already runs before the op commits; having the
ZFS provider expose a "has dependent clones" check there lets the delete be refused before any state
changes.
This phase choice matters: today the failure surfaces only at commit time, after glusterd_snapshot_remove_commit has already logged "marked snap … for decommission" (see the log above).
Merely propagating the error from glusterd_snapshot_remove (which currently logs it, then has ret
overwritten by the sys_rmdir cleanup) would fail the op but could leave a partially-decommissioned
snapshot. Rejecting in prevalidate avoids that half-state; a commit-phase guard, if also wanted, needs
proper rollback. Exact placement is the implementer's call.
Either way, the existing tolerant behaviour for a backend snapshot that is already gone (where continuing
is correct) should be preserved — only the has-dependent-clone case should be refused. A blanket change to
stop the caller overwriting ret is not proposed: that path is shared with the LVM backend and the
auto-delete/GC callers and would need a wider audit.
Why not zfs promote?
The standard ZFS way to drop a snapshot that has dependents is zfs promote, but it is not a safe fix here
(observed on ZFS 2.4.2): zfs promote <clone> does not free the snapshot — it relocates it into the clone's
namespace and swaps roles, so the original snapshot name no longer exists (a follow-up zfs destroy <snap_device> then exits 1, "could not find any snapshots to destroy"), and it makes the live brick a dependent clone of the snapshot-clone volume. After that, a zfs destroy -R of the clone
volume would also destroy the live brick. Failing the delete (above) avoids creating that hidden
dependency.
Notes
A PR for the provider-scoped fix can follow; happy to align on the preferred shape first.
A formal model of the snapshot lifecycle and the full annotated testbed transcript (including the zfs promote relocation/cascade observations) are available on request.
Environment / provenance
develat commitae1d69672f(the line numbers below are at that commit),unchanged since introduction: the provider's
zfs clone/zfs destroylines come from8a17e531c05(theZFS provider, PR snapshots: Support for ZFS Snapshots #2855, 2021-12-25); the generic swallow in
glusterd_snapshot_removefrom the snapshotpluginization commit
1787def0ab1(2021-12-17).is per-brick (the provider acts per brick; the swallow is in the generic per-snapshot remove path), so a
multi-brick volume would orphan one backend snapshot per affected brick — not separately reproduced here.
Summary
snapshot clone(andsnapshot restore) create the backend snapshot's clone withzfs cloneand neverzfs promoteit, so the clone keepsorigin == <snapshot>— a normal ZFS dependency. A latersnapshot deleteof that source snapshot runs a plainzfs destroy <snapshot>(no-R), which ZFScorrectly refuses while a dependent clone exists, returning a nonzero status. glusterd logs that failure at
ERROR but then resets the transaction return code to 0 during directory cleanup, so the delete reports
success: the CLI prints
snap removed successfully(exit 0) and glusterd drops the snapshot object, whilethe backend ZFS snapshot remains — now untracked by glusterd.
The defect is the silent-to-the-operator success on a real backend failure. (It is logged at ERROR in
glusterd.log, but the CLI/RPC result is success.) The dependent clone itself is a normal, glusterd-trackedclone volume; the leaked object is the backend ZFS snapshot it holds.
Affected code
Both clone and restore build the clone with
zfs clone, neverzfs promote(
xlators/mgmt/glusterd/src/snapshot/glusterd-zfs-snapshot.c):Remove is a plain
zfs destroy, no-R(its only removal primitive — there is nozfs promote/-Ranywhere in the provider),
glusterd_zfs_snapshot_remove:The nonzero return is dropped by the caller,
glusterd_snapshot_remove(glusterd-snapshot.c):So the failure is logged but execution falls through;
retis then reassigned bysys_rmdir(and forced to0 if that also fails), so the original backend-removal failure is unconditionally overwritten and the
transaction returns success. (
snap_ops->removedispatches through the plugin vtable — selected inglusterd-snapshot-utils.c— so this caller is shared by the LVM backend too.)Root cause
A ZFS snapshot with a dependent clone genuinely cannot be removed by a plain
zfs destroy— that iscorrect ZFS behaviour, because the clone shares the snapshot's blocks. The bug is not that the destroy
fails; it is that
glusterd_snapshot_removedoes not propagate that failure (it logs it, falls through, anda later cleanup step resets the return code to 0), so a recoverable backend failure is reported to the
operator as success and the snapshot object is dropped.
Reproduction
Single-brick volume on a ZFS dataset (so the provider selects
zfs_snap_ops):Observed output (GlusterFS 11.2, ZFS 2.4.2; identifiers shortened):
The matching glusterd.log lines for that delete (logged at ERROR while the CLI returned success):
snapshot restoreis also implemented via an un-promotedzfs clone(line 430), creating the same kind ofsnapshot↔clone dependency; only the
snapshot clonepath above was reproduced end-to-end (it is thecleanest trigger).
Impact
A
snapshot deletethat did not remove the backend snapshot is reported to the operator as success, andglusterd forgets the snapshot. The ZFS snapshot is left behind — no longer removable by the provider's plain
zfs destroywhile the clone depends on it, and no longer referenced by glusterd, so nothing will everretry it. It accumulates whenever a snapshot is cloned-from (or restored-from) and that source snapshot is
later deleted while the clone still exists. This is not data loss (no
-Ris issued), but it is a correctnessbug (success reported for a failed operation) and a steady consumer of pool space that is invisible to
gluster snapshot list.Suggested fix
The delete should fail with a clear, actionable error instead of reporting success — e.g. "cannot delete
snapshot snap1: it is in use by clone volume
clonevol; delete the clone first." The dependent clone isdiscoverable via
zfs get -H -o value clones <snap_device>.The clean place to reject this is the prevalidate phase.
glusterd_snapshot_remove_prevalidate()(dispatched for delete by
glusterd_snapshot_prevalidate) already runs before the op commits; having theZFS provider expose a "has dependent clones" check there lets the delete be refused before any state
changes.
This phase choice matters: today the failure surfaces only at commit time, after
glusterd_snapshot_remove_commithas already logged "marked snap … for decommission" (see the log above).Merely propagating the error from
glusterd_snapshot_remove(which currently logs it, then hasretoverwritten by the
sys_rmdircleanup) would fail the op but could leave a partially-decommissionedsnapshot. Rejecting in prevalidate avoids that half-state; a commit-phase guard, if also wanted, needs
proper rollback. Exact placement is the implementer's call.
Either way, the existing tolerant behaviour for a backend snapshot that is already gone (where continuing
is correct) should be preserved — only the has-dependent-clone case should be refused. A blanket change to
stop the caller overwriting
retis not proposed: that path is shared with the LVM backend and theauto-delete/GC callers and would need a wider audit.
Why not
zfs promote?The standard ZFS way to drop a snapshot that has dependents is
zfs promote, but it is not a safe fix here(observed on ZFS 2.4.2):
zfs promote <clone>does not free the snapshot — it relocates it into the clone'snamespace and swaps roles, so the original snapshot name no longer exists (a follow-up
zfs destroy <snap_device>then exits 1, "could not find any snapshots to destroy"), and it makes thelive brick a dependent clone of the snapshot-clone volume. After that, a
zfs destroy -Rof the clonevolume would also destroy the live brick. Failing the delete (above) avoids creating that hidden
dependency.
Notes
zfs promoterelocation/cascade observations) are available on request.