Skip to content

Commit 601ef0d

Browse files
committed
gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal in an incomplete state. This is especially true when the withdraw is caused by io errors writing to the journal. Before this patch, a withdraw would try to write a "shutdown" record to the journal, tell dlm it's done with the file system, and none of the other nodes know about the problem. Later, when the problem is fixed and the withdrawn node is rebooted, it would then discover that its own journal was incomplete, and replay it. However, replaying it at this point is almost guaranteed to introduce corruption because the other nodes are likely to have used affected resource groups that appeared in the journal since the time of the withdraw. Replaying the journal later will overwrite any changes made, and not through any fault of dlm, which was instructed during the withdraw to release those resources. This patch makes file system withdraws seen by the entire cluster. Withdrawing nodes dequeue their journal glock to allow recovery. The remaining nodes check all the journals to see if they are clean or in need of replay. They try to replay dirty journals, but only the journals of withdrawn nodes will be "not busy" and therefore available for replay. Until the journal replay is complete, no i/o related glocks may be given out, to ensure that the replay does not cause the aforementioned corruption: We cannot allow any journal replay to overwrite blocks associated with a glock once it is held. The "live" glock which is now used to signal when a withdraw occurs. When a withdraw occurs, the node signals its withdraw by dequeueing the "live" glock and trying to enqueue it in EX mode, thus forcing the other nodes to all see a demote request, by way of a "1CB" (one callback) try lock. The "live" glock is not granted in EX; the callback is only just used to indicate a withdraw has occurred. Note that all nodes in the cluster must wait for the recovering node to finish replaying the withdrawing node's journal before continuing. To this end, it checks that the journals are clean multiple times in a retry loop. Also note that the withdraw function may be called from a wide variety of situations, and therefore, we need to take extra precautions to make sure pointers are valid before using them in many circumstances. We also need to take care when glocks decide to withdraw, since the withdraw code now uses glocks. Also, before this patch, if a process encountered an error and decided to withdraw, if another process was already withdrawing, the second withdraw would be silently ignored, which set it free to unlock its glocks. That's correct behavior if the original withdrawer encounters further errors down the road. But if secondary waiters don't wait for the journal replay, unlocking glocks will allow other nodes to use them, despite the fact that the journal containing those blocks is being replayed. The replay needs to finish before our glocks are released to other nodes. IOW, secondary withdraws need to wait for the first withdraw to finish. For example, if an rgrp glock is unlocked by a process that didn't wait for the first withdraw, a journal replay could introduce file system corruption by replaying a rgrp block that has already been granted to a different cluster node. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
1 parent a72d240 commit 601ef0d

File tree

11 files changed

+390
-31
lines changed

11 files changed

+390
-31
lines changed

fs/gfs2/glock.c

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -271,7 +271,7 @@ static void __gfs2_glock_put(struct gfs2_glock *gl)
271271
gfs2_glock_remove_from_lru(gl);
272272
spin_unlock(&gl->gl_lockref.lock);
273273
GLOCK_BUG_ON(gl, !list_empty(&gl->gl_holders));
274-
GLOCK_BUG_ON(gl, mapping && mapping->nrpages);
274+
GLOCK_BUG_ON(gl, mapping && mapping->nrpages && !gfs2_withdrawn(sdp));
275275
trace_gfs2_glock_put(gl);
276276
sdp->sd_lockstruct.ls_ops->lm_put_lock(gl);
277277
}
@@ -576,7 +576,8 @@ __acquires(&gl->gl_lockref.lock)
576576
unsigned int lck_flags = (unsigned int)(gh ? gh->gh_flags : 0);
577577
int ret;
578578

579-
if (target != LM_ST_UNLOCKED && glock_blocked_by_withdraw(gl))
579+
if (target != LM_ST_UNLOCKED && glock_blocked_by_withdraw(gl) &&
580+
gh && !(gh->gh_flags & LM_FLAG_NOEXP))
580581
return;
581582
lck_flags &= (LM_FLAG_TRY | LM_FLAG_TRY_1CB | LM_FLAG_NOEXP |
582583
LM_FLAG_PRIORITY);
@@ -1222,7 +1223,7 @@ int gfs2_glock_nq(struct gfs2_holder *gh)
12221223
struct gfs2_glock *gl = gh->gh_gl;
12231224
int error = 0;
12241225

1225-
if (glock_blocked_by_withdraw(gl))
1226+
if (glock_blocked_by_withdraw(gl) && !(gh->gh_flags & LM_FLAG_NOEXP))
12261227
return -EIO;
12271228

12281229
if (test_bit(GLF_LRU, &gl->gl_flags))
@@ -1266,10 +1267,26 @@ int gfs2_glock_poll(struct gfs2_holder *gh)
12661267
void gfs2_glock_dq(struct gfs2_holder *gh)
12671268
{
12681269
struct gfs2_glock *gl = gh->gh_gl;
1270+
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
12691271
unsigned delay = 0;
12701272
int fast_path = 0;
12711273

12721274
spin_lock(&gl->gl_lockref.lock);
1275+
/*
1276+
* If we're in the process of file system withdraw, we cannot just
1277+
* dequeue any glocks until our journal is recovered, lest we
1278+
* introduce file system corruption. We need two exceptions to this
1279+
* rule: We need to allow unlocking of nondisk glocks and the glock
1280+
* for our own journal that needs recovery.
1281+
*/
1282+
if (test_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags) &&
1283+
glock_blocked_by_withdraw(gl) &&
1284+
gh->gh_gl != sdp->sd_jinode_gl) {
1285+
sdp->sd_glock_dqs_held++;
1286+
might_sleep();
1287+
wait_on_bit(&sdp->sd_flags, SDF_WITHDRAW_RECOVERY,
1288+
TASK_UNINTERRUPTIBLE);
1289+
}
12731290
if (gh->gh_flags & GL_NOCACHE)
12741291
handle_callback(gl, LM_ST_UNLOCKED, 0, false);
12751292

fs/gfs2/glops.c

Lines changed: 76 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,8 @@
2929

3030
struct workqueue_struct *gfs2_freeze_wq;
3131

32+
extern struct workqueue_struct *gfs2_control_wq;
33+
3234
static void gfs2_ail_error(struct gfs2_glock *gl, const struct buffer_head *bh)
3335
{
3436
fs_err(gl->gl_name.ln_sbd,
@@ -496,13 +498,17 @@ static void freeze_go_sync(struct gfs2_glock *gl)
496498
int error = 0;
497499
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
498500

499-
if (gl->gl_state == LM_ST_SHARED &&
501+
if (gl->gl_state == LM_ST_SHARED && !gfs2_withdrawn(sdp) &&
500502
test_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags)) {
501503
atomic_set(&sdp->sd_freeze_state, SFS_STARTING_FREEZE);
502504
error = freeze_super(sdp->sd_vfs);
503505
if (error) {
504506
fs_info(sdp, "GFS2: couldn't freeze filesystem: %d\n",
505507
error);
508+
if (gfs2_withdrawn(sdp)) {
509+
atomic_set(&sdp->sd_freeze_state, SFS_UNFROZEN);
510+
return;
511+
}
506512
gfs2_assert_withdraw(sdp, 0);
507513
}
508514
queue_work(gfs2_freeze_wq, &sdp->sd_freeze_work);
@@ -577,6 +583,73 @@ static void iopen_go_callback(struct gfs2_glock *gl, bool remote)
577583
}
578584
}
579585

586+
/**
587+
* inode_go_free - wake up anyone waiting for dlm's unlock ast to free it
588+
* @gl: glock being freed
589+
*
590+
* For now, this is only used for the journal inode glock. In withdraw
591+
* situations, we need to wait for the glock to be freed so that we know
592+
* other nodes may proceed with recovery / journal replay.
593+
*/
594+
static void inode_go_free(struct gfs2_glock *gl)
595+
{
596+
/* Note that we cannot reference gl_object because it's already set
597+
* to NULL by this point in its lifecycle. */
598+
if (!test_bit(GLF_FREEING, &gl->gl_flags))
599+
return;
600+
clear_bit_unlock(GLF_FREEING, &gl->gl_flags);
601+
wake_up_bit(&gl->gl_flags, GLF_FREEING);
602+
}
603+
604+
/**
605+
* nondisk_go_callback - used to signal when a node did a withdraw
606+
* @gl: the nondisk glock
607+
* @remote: true if this came from a different cluster node
608+
*
609+
*/
610+
static void nondisk_go_callback(struct gfs2_glock *gl, bool remote)
611+
{
612+
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
613+
614+
/* Ignore the callback unless it's from another node, and it's the
615+
live lock. */
616+
if (!remote || gl->gl_name.ln_number != GFS2_LIVE_LOCK)
617+
return;
618+
619+
/* First order of business is to cancel the demote request. We don't
620+
* really want to demote a nondisk glock. At best it's just to inform
621+
* us of another node's withdraw. We'll keep it in SH mode. */
622+
clear_bit(GLF_DEMOTE, &gl->gl_flags);
623+
clear_bit(GLF_PENDING_DEMOTE, &gl->gl_flags);
624+
625+
/* Ignore the unlock if we're withdrawn, unmounting, or in recovery. */
626+
if (test_bit(SDF_NORECOVERY, &sdp->sd_flags) ||
627+
test_bit(SDF_WITHDRAWN, &sdp->sd_flags) ||
628+
test_bit(SDF_REMOTE_WITHDRAW, &sdp->sd_flags))
629+
return;
630+
631+
/* We only care when a node wants us to unlock, because that means
632+
* they want a journal recovered. */
633+
if (gl->gl_demote_state != LM_ST_UNLOCKED)
634+
return;
635+
636+
if (sdp->sd_args.ar_spectator) {
637+
fs_warn(sdp, "Spectator node cannot recover journals.\n");
638+
return;
639+
}
640+
641+
fs_warn(sdp, "Some node has withdrawn; checking for recovery.\n");
642+
set_bit(SDF_REMOTE_WITHDRAW, &sdp->sd_flags);
643+
/*
644+
* We can't call remote_withdraw directly here or gfs2_recover_journal
645+
* because this is called from the glock unlock function and the
646+
* remote_withdraw needs to enqueue and dequeue the same "live" glock
647+
* we were called from. So we queue it to the control work queue in
648+
* lock_dlm.
649+
*/
650+
queue_delayed_work(gfs2_control_wq, &sdp->sd_control_work, 0);
651+
}
652+
580653
const struct gfs2_glock_operations gfs2_meta_glops = {
581654
.go_type = LM_TYPE_META,
582655
.go_flags = GLOF_NONDISK,
@@ -590,6 +663,7 @@ const struct gfs2_glock_operations gfs2_inode_glops = {
590663
.go_dump = inode_go_dump,
591664
.go_type = LM_TYPE_INODE,
592665
.go_flags = GLOF_ASPACE | GLOF_LRU,
666+
.go_free = inode_go_free,
593667
};
594668

595669
const struct gfs2_glock_operations gfs2_rgrp_glops = {
@@ -623,6 +697,7 @@ const struct gfs2_glock_operations gfs2_flock_glops = {
623697
const struct gfs2_glock_operations gfs2_nondisk_glops = {
624698
.go_type = LM_TYPE_NONDISK,
625699
.go_flags = GLOF_NONDISK,
700+
.go_callback = nondisk_go_callback,
626701
};
627702

628703
const struct gfs2_glock_operations gfs2_quota_glops = {

fs/gfs2/incore.h

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -242,6 +242,7 @@ struct gfs2_glock_operations {
242242
void (*go_dump)(struct seq_file *seq, struct gfs2_glock *gl,
243243
const char *fs_id_buf);
244244
void (*go_callback)(struct gfs2_glock *gl, bool remote);
245+
void (*go_free)(struct gfs2_glock *gl);
245246
const int go_type;
246247
const unsigned long go_flags;
247248
#define GLOF_ASPACE 1 /* address space attached */
@@ -343,6 +344,7 @@ enum {
343344
GLF_OBJECT = 14, /* Used only for tracing */
344345
GLF_BLOCKING = 15,
345346
GLF_INODE_CREATING = 16, /* Inode creation occurring */
347+
GLF_FREEING = 18, /* Wait for glock to be freed */
346348
};
347349

348350
struct gfs2_glock {
@@ -619,6 +621,10 @@ enum {
619621
SDF_FORCE_AIL_FLUSH = 9,
620622
SDF_FS_FROZEN = 10,
621623
SDF_WITHDRAWING = 11, /* Will withdraw eventually */
624+
SDF_WITHDRAW_IN_PROG = 12, /* Withdraw is in progress */
625+
SDF_REMOTE_WITHDRAW = 13, /* Performing remote recovery */
626+
SDF_WITHDRAW_RECOVERY = 14, /* Wait for journal recovery when we are
627+
withdrawing */
622628
};
623629

624630
enum gfs2_freeze_state {
@@ -769,6 +775,7 @@ struct gfs2_sbd {
769775
struct gfs2_jdesc *sd_jdesc;
770776
struct gfs2_holder sd_journal_gh;
771777
struct gfs2_holder sd_jinode_gh;
778+
struct gfs2_glock *sd_jinode_gl;
772779

773780
struct gfs2_holder sd_sc_gh;
774781
struct gfs2_holder sd_qc_gh;
@@ -830,6 +837,7 @@ struct gfs2_sbd {
830837
struct bio *sd_log_bio;
831838
wait_queue_head_t sd_log_flush_wait;
832839
int sd_log_error; /* First log error */
840+
wait_queue_head_t sd_withdraw_wait;
833841

834842
atomic_t sd_reserving_log;
835843
wait_queue_head_t sd_reserving_log_wait;
@@ -853,6 +861,7 @@ struct gfs2_sbd {
853861

854862
unsigned long sd_last_warning;
855863
struct dentry *debugfs_dir; /* debugfs directory */
864+
unsigned long sd_glock_dqs_held;
856865
};
857866

858867
static inline void gfs2_glstats_inc(struct gfs2_glock *gl, int which)

fs/gfs2/lock_dlm.c

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@
1616

1717
#include "incore.h"
1818
#include "glock.h"
19+
#include "glops.h"
20+
#include "recovery.h"
1921
#include "util.h"
2022
#include "sys.h"
2123
#include "trace_gfs2.h"
@@ -124,6 +126,8 @@ static void gdlm_ast(void *arg)
124126

125127
switch (gl->gl_lksb.sb_status) {
126128
case -DLM_EUNLOCK: /* Unlocked, so glock can be freed */
129+
if (gl->gl_ops->go_free)
130+
gl->gl_ops->go_free(gl);
127131
gfs2_glock_free(gl);
128132
return;
129133
case -DLM_ECANCEL: /* Cancel while getting lock */
@@ -323,6 +327,7 @@ static void gdlm_cancel(struct gfs2_glock *gl)
323327
/*
324328
* dlm/gfs2 recovery coordination using dlm_recover callbacks
325329
*
330+
* 0. gfs2 checks for another cluster node withdraw, needing journal replay
326331
* 1. dlm_controld sees lockspace members change
327332
* 2. dlm_controld blocks dlm-kernel locking activity
328333
* 3. dlm_controld within dlm-kernel notifies gfs2 (recover_prep)
@@ -571,6 +576,28 @@ static int control_lock(struct gfs2_sbd *sdp, int mode, uint32_t flags)
571576
&ls->ls_control_lksb, "control_lock");
572577
}
573578

579+
/**
580+
* remote_withdraw - react to a node withdrawing from the file system
581+
* @sdp: The superblock
582+
*/
583+
static void remote_withdraw(struct gfs2_sbd *sdp)
584+
{
585+
struct gfs2_jdesc *jd;
586+
int ret = 0, count = 0;
587+
588+
list_for_each_entry(jd, &sdp->sd_jindex_list, jd_list) {
589+
if (jd->jd_jid == sdp->sd_lockstruct.ls_jid)
590+
continue;
591+
ret = gfs2_recover_journal(jd, true);
592+
if (ret)
593+
break;
594+
count++;
595+
}
596+
597+
/* Now drop the additional reference we acquired */
598+
fs_err(sdp, "Journals checked: %d, ret = %d.\n", count, ret);
599+
}
600+
574601
static void gfs2_control_func(struct work_struct *work)
575602
{
576603
struct gfs2_sbd *sdp = container_of(work, struct gfs2_sbd, sd_control_work.work);
@@ -581,6 +608,13 @@ static void gfs2_control_func(struct work_struct *work)
581608
int recover_size;
582609
int i, error;
583610

611+
/* First check for other nodes that may have done a withdraw. */
612+
if (test_bit(SDF_REMOTE_WITHDRAW, &sdp->sd_flags)) {
613+
remote_withdraw(sdp);
614+
clear_bit(SDF_REMOTE_WITHDRAW, &sdp->sd_flags);
615+
return;
616+
}
617+
584618
spin_lock(&ls->ls_recover_spin);
585619
/*
586620
* No MOUNT_DONE means we're still mounting; control_mount()

fs/gfs2/meta_io.c

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -251,7 +251,8 @@ int gfs2_meta_read(struct gfs2_glock *gl, u64 blkno, int flags,
251251
struct buffer_head *bh, *bhs[2];
252252
int num = 0;
253253

254-
if (unlikely(gfs2_withdrawn(sdp))) {
254+
if (unlikely(gfs2_withdrawn(sdp)) &&
255+
(!sdp->sd_jdesc || (blkno != sdp->sd_jdesc->jd_no_addr))) {
255256
*bhp = NULL;
256257
return -EIO;
257258
}

fs/gfs2/ops_fstype.c

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -656,14 +656,16 @@ static int init_journal(struct gfs2_sbd *sdp, int undo)
656656

657657
error = gfs2_glock_nq_num(sdp, sdp->sd_lockstruct.ls_jid,
658658
&gfs2_journal_glops,
659-
LM_ST_EXCLUSIVE, LM_FLAG_NOEXP,
659+
LM_ST_EXCLUSIVE,
660+
LM_FLAG_NOEXP | GL_NOCACHE,
660661
&sdp->sd_journal_gh);
661662
if (error) {
662663
fs_err(sdp, "can't acquire journal glock: %d\n", error);
663664
goto fail_jindex;
664665
}
665666

666667
ip = GFS2_I(sdp->sd_jdesc->jd_inode);
668+
sdp->sd_jinode_gl = ip->i_gl;
667669
error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED,
668670
LM_FLAG_NOEXP | GL_EXACT | GL_NOCACHE,
669671
&sdp->sd_jinode_gh);
@@ -724,10 +726,13 @@ static int init_journal(struct gfs2_sbd *sdp, int undo)
724726
return 0;
725727

726728
fail_jinode_gh:
727-
if (!sdp->sd_args.ar_spectator)
729+
/* A withdraw may have done dq/uninit so now we need to check it */
730+
if (!sdp->sd_args.ar_spectator &&
731+
gfs2_holder_initialized(&sdp->sd_jinode_gh))
728732
gfs2_glock_dq_uninit(&sdp->sd_jinode_gh);
729733
fail_journal_gh:
730-
if (!sdp->sd_args.ar_spectator)
734+
if (!sdp->sd_args.ar_spectator &&
735+
gfs2_holder_initialized(&sdp->sd_journal_gh))
731736
gfs2_glock_dq_uninit(&sdp->sd_journal_gh);
732737
fail_jindex:
733738
gfs2_jindex_free(sdp);

fs/gfs2/quota.c

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1541,6 +1541,8 @@ int gfs2_quotad(void *data)
15411541

15421542
while (!kthread_should_stop()) {
15431543

1544+
if (gfs2_withdrawn(sdp))
1545+
goto bypass;
15441546
/* Update the master statfs file */
15451547
if (sdp->sd_statfs_force_sync) {
15461548
int error = gfs2_statfs_sync(sdp->sd_vfs, 0);
@@ -1561,6 +1563,7 @@ int gfs2_quotad(void *data)
15611563

15621564
try_to_freeze();
15631565

1566+
bypass:
15641567
t = min(quotad_timeo, statfs_timeo);
15651568

15661569
prepare_to_wait(&sdp->sd_quota_wait, &wait, TASK_INTERRUPTIBLE);

0 commit comments

Comments
 (0)