Skip to content

During pool export flush the ARC asynchronously #16215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 5, 2024

Conversation

don-brady
Copy link
Contributor

@don-brady don-brady commented May 21, 2024

Motivation and Context

When a pool is exported, the ARC buffers in use by that pool are flushed (evicted) as part of the export. In addition, any L2 VDEVs are removed from the L2 ARC. Both of these operations are performed sequentially and inline to the export. For larger ARC footprints, this can represent a significant amount of time. In an HA scenario, this can cause a planned failover to take longer than needed and risk timeouts on the services backed by the pool data.

Description

The teardown of the ARC data used by the pool can be done asynchronously during a pool export. For the main ARC data, the spa load GUID is used to associate a buffer with the spa so we can safely free the spa_t while the teardown proceeds in the background. For the L2 ARC VDEV, the device l2arc_dev_t has a copy of the vdev_t pointer which was being used to calculate the asize of the buffer from the psize when updating the L2 arc stats during the teardown. This asize value can be captured when the buffer is created, thereby eliminating the need for a late binding asize calculation using the VDEV during teardown.

Added an arc_flush_taskq for these background teardown tasks. In arc_fini() (e.g., during module unload) it now waits for any teardown tasks to complete.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

How Has This Been Tested?

  1. Manual testing with ARC and multiple L2 arc devices up to about 24G of arc data and ~100GB of L2 capacity. The pool export went from about 45 seconds down to 5 seconds with the asynchronous teardown in place.
  2. Manually tested exporting while a L2 rebuild was still in progress. The L2 vdev waits for the rebuild to be canceled before proceeding with the teardown.
  3. Ran various ZTS test suites, like l2arc, zpool_import, zpool_export, to exercise the changed code paths.
  4. Ran ztest

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@don-brady don-brady added the Status: Code Review Needed Ready for review and testing label May 22, 2024
@behlendorf behlendorf self-assigned this May 25, 2024
@don-brady
Copy link
Contributor Author

@gamanakis -- if you have time could you look at the L2 part of this change? Thanks

@gamanakis
Copy link
Contributor

Thanks for including me on this, on a first pass it looks good.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks interesting, but what happen if the pool (obviously with the same GUID) is re-imported while async flush is still running?

@don-brady
Copy link
Contributor Author

Looks interesting, but what happen if the pool (obviously with the same GUID) is re-imported while async flush is still running?

Good question and I tested this case. The async flush will continue it's best-effort pass to flush buffers associated with the exported spa's guid. Any ARC buffers that the import uses will have a positive ref count and be skipped by the async flush task. You can think of it as an alternate arc evict thread that is targeting a specific guid with a zero ref count rather than age.

I suppose we could have the task periodically check if there is an active spa with that guid and exit the task. I'm not sure how common it is to export and then re-import right away on the same host. Before this change, you would have to wait until the export finished flushing the ARC buffers.

The ARC teardown for a spa can take multiple minutes, so you could even have a second pool export+import while the first arc_flush_task() is still running and end up with two arc_flush_task() both looking to evict candidates for the same guid. This is not fatal but a little weird.

@amotin
Copy link
Member

amotin commented Jun 12, 2024

Aside of it being weird, I worry that even if unlikely it is not impossible for the pool to be changed while being exported, while ARC still holds the old data.

@allanjude
Copy link
Contributor

Aside of it being weird, I worry that even if unlikely it is not impossible for the pool to be changed while being exported, while ARC still holds the old data.

I share this concern. Would it make sense to block importing the same pool again until eviction is complete?

@don-brady
Copy link
Contributor Author

Per @amotin -- add a zpool wait for ARC teardown

@don-brady
Copy link
Contributor Author

To address concerns of re-importing after the pool was changed on another host:

Save txg at export and compare to imported txg

  • if same, then cancel the teardown (ARC data is still valid)
  • If different, force import to wait for teardown to complete (ARC data can be stale)

@amotin
Copy link
Member

amotin commented Jun 18, 2024

Thinking more about it, since ARC is indexed on DVA+birth, in case of pool export/import if some blocks appear stale, they should just never be used, unless we actually import some other pool somehow having the same GUID, or import the pool at earlier TXG. We already have somewhat similar problem with persistent L2ARC, when we load into ARC headers blocks that could be long freed from the pool, and they stay in ARC until L2ARC rotate and evict them. But in case of L2ARC we at least know that those stale blocks are from this pool and just won't be used again. I am not sure whether multiple pools with the same GUID is a realistic scenario, but importing the pool at earlier TXG I think may be more realistic, and dangerous same time, since the same TXG numbers may be reused.

@don-brady
Copy link
Contributor Author

Update on re-import while ARC tear-down in progress

Ever since the commit that added zpool reguid feature, the ARC uses the spa_load_guid, not the spa's actual guid for identification. This load guid is transient, not persistent and will change at each import. So after the import, any blocks left in the ARC with this load guid are now orphaned and not associated with any spa.

So we don't need to worry about ARC blocks that are still around when the pool is re-imported since it will identify its blocks using a different spa_load_guid.

@don-brady don-brady requested a review from behlendorf June 27, 2024 03:08
@don-brady don-brady requested a review from grwilson July 17, 2024 19:13
@richardelling
Copy link
Contributor

@don-brady I was wondering when you were going to remember the behaviour of spa_load_guid

Copy link
Member

@grwilson grwilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you looked at having arc_evict prioritize evicting from flushed pools vs trying to evict buffers from active pools?

@don-brady
Copy link
Contributor Author

Have you looked at having arc_evict prioritize evicting from flushed pools vs trying to evict buffers from active pools?
@grwilson

I hadn't considered it other than if it was safe. Both arc_evict() and arc_flush() sit on top of arc_evict_state(). In the flush case it is targeting a specific spa (guid) and in the evict case it ignores targeting (i.e. when guid == 0). So arc_evict_state() is either targeting a specific guid or none at all.

We can have multiple threads (arc_evict() and multiple arc_flush_async()) all running at the same time. And in the underlying arc_evict_state() it's using a randomly selected sublist. So they will likely be working on different buffers.

(a) One option would be to have arc_evict() thread back off when there are any active arc flushes so as to give those flush-initiated evictions priority. And maybe have the last arc flush wake up the arc evict thread.

(b) Another option would be to keep a list of active flush spa guids and have the arc_evict() thread only match on guids from the list if it is not empty -- so it ends up only targeting buffers that need to be flushed.

@don-brady
Copy link
Contributor Author

Rebased to fix merge conflicts.

Eviction thread nows considers active spa flushes and targets those spas (if any).

@don-brady
Copy link
Contributor Author

Removed last commit and rebased to latest master branch.

@don-brady
Copy link
Contributor Author

Rebased to latest master and switched to use taskq_dispatch_ent().

@don-brady
Copy link
Contributor Author

fixed a leak in arc_async_flush_list() found during module unload by ztest

@gamanakis
Copy link
Contributor

Thank you for this PR, the L2ARC part looks good to me in a first pass.

@amotin amotin added the Status: Revision Needed Changes are required for the PR to be accepted label Nov 6, 2024
@github-actions github-actions bot removed the Status: Revision Needed Changes are required for the PR to be accepted label Nov 19, 2024
This also includes removing L2 vdevs asynchronously.

This commit also guarantees that spa_load_guid is unique.

The zpool reguid feature introduced the spa_load_guid, which is a
transient value used for runtime identification purposes in the ARC.
This value is not the same as the spa's persistent pool guid.

However, the value is seeded from spa_generate_load_guid() which
does not check for uniqueness against the spa_load_guid from other
pools.  Although extremely rare, you can end up with two different
pools sharing the same spa_load_guid value! So we guarantee that
the value is always unique and additionally not still in use by an
async arc flush task.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Don Brady <don.brady@klarasystems.com>
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Dec 5, 2024
@behlendorf behlendorf merged commit 44446dc into openzfs:master Dec 5, 2024
23 of 24 checks passed
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Jan 26, 2025
This also includes removing L2 vdevs asynchronously.

This commit also guarantees that spa_load_guid is unique.

The zpool reguid feature introduced the spa_load_guid, which is a
transient value used for runtime identification purposes in the ARC.
This value is not the same as the spa's persistent pool guid.

However, the value is seeded from spa_generate_load_guid() which
does not check for uniqueness against the spa_load_guid from other
pools.  Although extremely rare, you can end up with two different
pools sharing the same spa_load_guid value! So we guarantee that
the value is always unique and additionally not still in use by an
async arc flush task.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes openzfs#16215
allanjude pushed a commit to KlaraSystems/zfs that referenced this pull request Feb 19, 2025
This also includes removing L2 vdevs asynchronously.

This commit also guarantees that spa_load_guid is unique.

The zpool reguid feature introduced the spa_load_guid, which is a
transient value used for runtime identification purposes in the ARC.
This value is not the same as the spa's persistent pool guid.

However, the value is seeded from spa_generate_load_guid() which
does not check for uniqueness against the spa_load_guid from other
pools.  Although extremely rare, you can end up with two different
pools sharing the same spa_load_guid value! So we guarantee that
the value is always unique and additionally not still in use by an
async arc flush task.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes openzfs#16215
(cherry picked from commit 44446dc)
pcd1193182 pushed a commit to KlaraSystems/zfs that referenced this pull request Mar 24, 2025
This also includes removing L2 vdevs asynchronously.

This commit also guarantees that spa_load_guid is unique.

The zpool reguid feature introduced the spa_load_guid, which is a
transient value used for runtime identification purposes in the ARC.
This value is not the same as the spa's persistent pool guid.

However, the value is seeded from spa_generate_load_guid() which
does not check for uniqueness against the spa_load_guid from other
pools.  Although extremely rare, you can end up with two different
pools sharing the same spa_load_guid value! So we guarantee that
the value is always unique and additionally not still in use by an
async arc flush task.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes openzfs#16215
ixhamza pushed a commit to truenas/zfs that referenced this pull request May 14, 2025
This also includes removing L2 vdevs asynchronously.

This commit also guarantees that spa_load_guid is unique.

The zpool reguid feature introduced the spa_load_guid, which is a
transient value used for runtime identification purposes in the ARC.
This value is not the same as the spa's persistent pool guid.

However, the value is seeded from spa_generate_load_guid() which
does not check for uniqueness against the spa_load_guid from other
pools.  Although extremely rare, you can end up with two different
pools sharing the same spa_load_guid value! So we guarantee that
the value is always unique and additionally not still in use by an
async arc flush task.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes openzfs#16215
pcd1193182 pushed a commit to pcd1193182/zfs that referenced this pull request May 29, 2025
This also includes removing L2 vdevs asynchronously.

This commit also guarantees that spa_load_guid is unique.

The zpool reguid feature introduced the spa_load_guid, which is a
transient value used for runtime identification purposes in the ARC.
This value is not the same as the spa's persistent pool guid.

However, the value is seeded from spa_generate_load_guid() which
does not check for uniqueness against the spa_load_guid from other
pools.  Although extremely rare, you can end up with two different
pools sharing the same spa_load_guid value! So we guarantee that
the value is always unique and additionally not still in use by an
async arc flush task.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes openzfs#16215
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 13, 2025
This also includes removing L2 vdevs asynchronously.

This commit also guarantees that spa_load_guid is unique.

The zpool reguid feature introduced the spa_load_guid, which is a
transient value used for runtime identification purposes in the ARC.
This value is not the same as the spa's persistent pool guid.

However, the value is seeded from spa_generate_load_guid() which
does not check for uniqueness against the spa_load_guid from other
pools.  Although extremely rare, you can end up with two different
pools sharing the same spa_load_guid value! So we guarantee that
the value is always unique and additionally not still in use by an
async arc flush task.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes openzfs#16215
behlendorf pushed a commit that referenced this pull request Jun 17, 2025
This also includes removing L2 vdevs asynchronously.

This commit also guarantees that spa_load_guid is unique.

The zpool reguid feature introduced the spa_load_guid, which is a
transient value used for runtime identification purposes in the ARC.
This value is not the same as the spa's persistent pool guid.

However, the value is seeded from spa_generate_load_guid() which
does not check for uniqueness against the spa_load_guid from other
pools.  Although extremely rare, you can end up with two different
pools sharing the same spa_load_guid value! So we guarantee that
the value is always unique and additionally not still in use by an
async arc flush task.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #16215
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 17, 2025
This also includes removing L2 vdevs asynchronously.

This commit also guarantees that spa_load_guid is unique.

The zpool reguid feature introduced the spa_load_guid, which is a
transient value used for runtime identification purposes in the ARC.
This value is not the same as the spa's persistent pool guid.

However, the value is seeded from spa_generate_load_guid() which
does not check for uniqueness against the spa_load_guid from other
pools.  Although extremely rare, you can end up with two different
pools sharing the same spa_load_guid value! So we guarantee that
the value is always unique and additionally not still in use by an
async arc flush task.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes openzfs#16215
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants