Skip to content

Commit 2d948d6

Browse files
committed
More adaptive ARC eviction.
Traditionally ARC adaptation was limited to MRU/MFU distribution. But for years people with metadata-centric workload demanded mechanisms to also manage data/metadata distribution, that in original ZFS was just a FIFO. As result ZFS effectively got separate states for data and metadata, minimum and maximum metadata limits etc, but it all required manual tuning, was not adaptive and in its heart remained a bad FIFO. This change removes most of existing eviction logic, rewriting it from scratch. This makes MRU/MFU adaptation individual for data and meta- data, same as the distribution between data and metadata themselves. Since most of required states separation was already done, it only required to make arcs_size state field specific per data/metadata. The adaptation logic is still based on previous concept of ghost hits, just now it balances ARC capacity between 4 states: MRU data, MRU metadata, MFU data and MFU metadata. To simplify arc_c changes instead of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd and arc_pm, representing ARC balance between metadata and data, MRU and MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed point fractions. Since we care about the math result only when need to evict, this moves all the logic from arc_adapt() to arc_evict(), that reduces per-block overhead, since per-block operations are limited to stats collection, now moved from arc_adapt() to arc_access() and using cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag from many places. This change also removes number of metadata specific tunables, part of which were actually not functioning correctly, since not all metadata are equal and some (like L2ARC headers) are not really evictable. Instead it introduced single opaque knob zfs_arc_meta_balance, tuning ARC's reaction on ghost hits, allowing administrator give more or less preference to metadata without setting strict limits. Some of old code parts like arc_evict_meta() are just removed, because since introduction of ABD ARC they really make no sense: only headers referenced by small number of buffers are not evictable, and they are really not evictable no matter what this code do. Instead just call arc_prune_async() if too much metadata appear not evictable. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.
1 parent 5f42d1d commit 2d948d6

File tree

10 files changed

+478
-785
lines changed

10 files changed

+478
-785
lines changed

cmd/arc_summary

Lines changed: 71 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -270,16 +270,14 @@ def draw_graph(kstats_dict):
270270
arc_perc = f_perc(arc_stats['size'], arc_stats['c_max'])
271271
mfu_size = f_bytes(arc_stats['mfu_size'])
272272
mru_size = f_bytes(arc_stats['mru_size'])
273-
meta_limit = f_bytes(arc_stats['arc_meta_limit'])
274273
meta_size = f_bytes(arc_stats['arc_meta_used'])
275274
dnode_limit = f_bytes(arc_stats['arc_dnode_limit'])
276275
dnode_size = f_bytes(arc_stats['dnode_size'])
277276

278-
info_form = ('ARC: {0} ({1}) MFU: {2} MRU: {3} META: {4} ({5}) '
279-
'DNODE {6} ({7})')
277+
info_form = ('ARC: {0} ({1}) MFU: {2} MRU: {3} META: {4} '
278+
'DNODE {5} ({6})')
280279
info_line = info_form.format(arc_size, arc_perc, mfu_size, mru_size,
281-
meta_size, meta_limit, dnode_size,
282-
dnode_limit)
280+
meta_size, dnode_size, dnode_limit)
283281
info_spc = ' '*int((GRAPH_WIDTH-len(info_line))/2)
284282
info_line = GRAPH_INDENT+info_spc+info_line
285283

@@ -558,16 +556,28 @@ def section_arc(kstats_dict):
558556
arc_target_size = arc_stats['c']
559557
arc_max = arc_stats['c_max']
560558
arc_min = arc_stats['c_min']
561-
anon_size = arc_stats['anon_size']
562-
mfu_size = arc_stats['mfu_size']
563-
mru_size = arc_stats['mru_size']
564-
mfug_size = arc_stats['mfu_ghost_size']
565-
mrug_size = arc_stats['mru_ghost_size']
566-
unc_size = arc_stats['uncached_size']
567-
meta_limit = arc_stats['arc_meta_limit']
568-
meta_size = arc_stats['arc_meta_used']
559+
meta = arc_stats['meta']
560+
pd = arc_stats['pd']
561+
pm = arc_stats['pm']
562+
anon_data = arc_stats['anon_data']
563+
anon_metadata = arc_stats['anon_metadata']
564+
mfu_data = arc_stats['mfu_data']
565+
mfu_metadata = arc_stats['mfu_metadata']
566+
mru_data = arc_stats['mru_data']
567+
mru_metadata = arc_stats['mru_metadata']
568+
mfug_data = arc_stats['mfu_ghost_data']
569+
mfug_metadata = arc_stats['mfu_ghost_metadata']
570+
mrug_data = arc_stats['mru_ghost_data']
571+
mrug_metadata = arc_stats['mru_ghost_metadata']
572+
unc_data = arc_stats['uncached_data']
573+
unc_metadata = arc_stats['uncached_metadata']
574+
bonus_size = arc_stats['bonus_size']
569575
dnode_limit = arc_stats['arc_dnode_limit']
570576
dnode_size = arc_stats['dnode_size']
577+
dbuf_size = arc_stats['dbuf_size']
578+
hdr_size = arc_stats['hdr_size']
579+
l2_hdr_size = arc_stats['l2_hdr_size']
580+
abd_chunk_waste_size = arc_stats['abd_chunk_waste_size']
571581
target_size_ratio = '{0}:1'.format(int(arc_max) // int(arc_min))
572582

573583
prt_2('ARC size (current):',
@@ -578,25 +588,56 @@ def section_arc(kstats_dict):
578588
f_perc(arc_min, arc_max), f_bytes(arc_min))
579589
prt_i2('Max size (high water):',
580590
target_size_ratio, f_bytes(arc_max))
581-
caches_size = int(anon_size)+int(mfu_size)+int(mru_size)+int(unc_size)
582-
prt_i2('Anonymouns data size:',
583-
f_perc(anon_size, caches_size), f_bytes(anon_size))
584-
prt_i2('Most Frequently Used (MFU) cache size:',
585-
f_perc(mfu_size, caches_size), f_bytes(mfu_size))
586-
prt_i2('Most Recently Used (MRU) cache size:',
587-
f_perc(mru_size, caches_size), f_bytes(mru_size))
588-
prt_i1('Most Frequently Used (MFU) ghost size:', f_bytes(mfug_size))
589-
prt_i1('Most Recently Used (MRU) ghost size:', f_bytes(mrug_size))
591+
caches_size = int(anon_data)+int(anon_metadata)+\
592+
int(mfu_data)+int(mfu_metadata)+int(mru_data)+int(mru_metadata)+\
593+
int(unc_data)+int(unc_metadata)
594+
prt_i2('Anonymous data size:',
595+
f_perc(anon_data, caches_size), f_bytes(anon_data))
596+
prt_i2('Anonymous metadata size:',
597+
f_perc(anon_metadata, caches_size), f_bytes(anon_metadata))
598+
s = 4294967296
599+
v = (s-int(pd))*(s-int(meta))/s
600+
prt_i2('MFU data target:', f_perc(v, s),
601+
f_bytes(v / 65536 * caches_size / 65536))
602+
prt_i2('MFU data size:',
603+
f_perc(mfu_data, caches_size), f_bytes(mfu_data))
604+
prt_i1('MFU ghost data size:', f_bytes(mfug_data))
605+
v = (s-int(pm))*int(meta)/s
606+
prt_i2('MFU metadata target:', f_perc(v, s),
607+
f_bytes(v / 65536 * caches_size / 65536))
608+
prt_i2('MFU metadata size:',
609+
f_perc(mfu_metadata, caches_size), f_bytes(mfu_metadata))
610+
prt_i1('MFU ghost metadata size:', f_bytes(mfug_metadata))
611+
v = int(pd)*(s-int(meta))/s
612+
prt_i2('MRU data target:', f_perc(v, s),
613+
f_bytes(v / 65536 * caches_size / 65536))
614+
prt_i2('MRU data size:',
615+
f_perc(mru_data, caches_size), f_bytes(mru_data))
616+
prt_i1('MRU ghost data size:', f_bytes(mrug_data))
617+
v = int(pm)*int(meta)/s
618+
prt_i2('MRU metadata target:', f_perc(v, s),
619+
f_bytes(v / 65536 * caches_size / 65536))
620+
prt_i2('MRU metadata size:',
621+
f_perc(mru_metadata, caches_size), f_bytes(mru_metadata))
622+
prt_i1('MRU ghost metadata size:', f_bytes(mrug_metadata))
590623
prt_i2('Uncached data size:',
591-
f_perc(unc_size, caches_size), f_bytes(unc_size))
592-
prt_i2('Metadata cache size (hard limit):',
593-
f_perc(meta_limit, arc_max), f_bytes(meta_limit))
594-
prt_i2('Metadata cache size (current):',
595-
f_perc(meta_size, meta_limit), f_bytes(meta_size))
596-
prt_i2('Dnode cache size (hard limit):',
597-
f_perc(dnode_limit, meta_limit), f_bytes(dnode_limit))
598-
prt_i2('Dnode cache size (current):',
624+
f_perc(unc_data, caches_size), f_bytes(unc_data))
625+
prt_i2('Uncached metadata size:',
626+
f_perc(unc_metadata, caches_size), f_bytes(unc_metadata))
627+
prt_i2('Bonus size:',
628+
f_perc(bonus_size, arc_size), f_bytes(bonus_size))
629+
prt_i2('Dnode cache target:',
630+
f_perc(dnode_limit, arc_max), f_bytes(dnode_limit))
631+
prt_i2('Dnode cache size:',
599632
f_perc(dnode_size, dnode_limit), f_bytes(dnode_size))
633+
prt_i2('Dbuf size:',
634+
f_perc(dbuf_size, arc_size), f_bytes(dbuf_size))
635+
prt_i2('Header size:',
636+
f_perc(hdr_size, arc_size), f_bytes(hdr_size))
637+
prt_i2('L2 header size:',
638+
f_perc(l2_hdr_size, arc_size), f_bytes(l2_hdr_size))
639+
prt_i2('ABD chunk waste size:',
640+
f_perc(abd_chunk_waste_size, arc_size), f_bytes(abd_chunk_waste_size))
600641
print()
601642

602643
print('ARC hash breakdown:')

cmd/zdb/zdb.c

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,6 @@ zdb_ot_name(dmu_object_type_t type)
116116

117117
extern int reference_tracking_enable;
118118
extern int zfs_recover;
119-
extern unsigned long zfs_arc_meta_min, zfs_arc_meta_limit;
120119
extern uint_t zfs_vdev_async_read_max_active;
121120
extern boolean_t spa_load_verify_dryrun;
122121
extern boolean_t spa_mode_readable_spacemaps;
@@ -8656,8 +8655,8 @@ main(int argc, char **argv)
86568655
* ZDB does not typically re-read blocks; therefore limit the ARC
86578656
* to 256 MB, which can be used entirely for metadata.
86588657
*/
8659-
zfs_arc_min = zfs_arc_meta_min = 2ULL << SPA_MAXBLOCKSHIFT;
8660-
zfs_arc_max = zfs_arc_meta_limit = 256 * 1024 * 1024;
8658+
zfs_arc_min = 2ULL << SPA_MAXBLOCKSHIFT;
8659+
zfs_arc_max = 256 * 1024 * 1024;
86618660
#endif
86628661

86638662
/*

include/sys/arc.h

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -200,7 +200,6 @@ struct arc_buf {
200200
};
201201

202202
typedef enum arc_buf_contents {
203-
ARC_BUFC_INVALID, /* invalid type */
204203
ARC_BUFC_DATA, /* buffer contains data */
205204
ARC_BUFC_METADATA, /* buffer contains metadata */
206205
ARC_BUFC_NUMTYPES

include/sys/arc_impl.h

Lines changed: 29 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -82,15 +82,18 @@ typedef struct arc_state {
8282
* supports the "dbufs" kstat
8383
*/
8484
arc_state_type_t arcs_state;
85+
/*
86+
* total amount of data in this state.
87+
*/
88+
zfs_refcount_t arcs_size[ARC_BUFC_NUMTYPES] ____cacheline_aligned;
8589
/*
8690
* total amount of evictable data in this state
8791
*/
88-
zfs_refcount_t arcs_esize[ARC_BUFC_NUMTYPES] ____cacheline_aligned;
92+
zfs_refcount_t arcs_esize[ARC_BUFC_NUMTYPES];
8993
/*
90-
* total amount of data in this state; this includes: evictable,
91-
* non-evictable, ARC_BUFC_DATA, and ARC_BUFC_METADATA.
94+
* amount of hit bytes for this state (counted only for ghost states)
9295
*/
93-
zfs_refcount_t arcs_size;
96+
wmsum_t arcs_hits[ARC_BUFC_NUMTYPES];
9497
} arc_state_t;
9598

9699
typedef struct arc_callback arc_callback_t;
@@ -358,8 +361,9 @@ typedef struct l2arc_lb_ptr_buf {
358361
#define L2BLK_SET_PREFETCH(field, x) BF64_SET((field), 39, 1, x)
359362
#define L2BLK_GET_CHECKSUM(field) BF64_GET((field), 40, 8)
360363
#define L2BLK_SET_CHECKSUM(field, x) BF64_SET((field), 40, 8, x)
361-
#define L2BLK_GET_TYPE(field) BF64_GET((field), 48, 8)
362-
#define L2BLK_SET_TYPE(field, x) BF64_SET((field), 48, 8, x)
364+
/* +/- 1 here are to keep compatibility after ARC_BUFC_INVALID removal. */
365+
#define L2BLK_GET_TYPE(field) (BF64_GET((field), 48, 8) - 1)
366+
#define L2BLK_SET_TYPE(field, x) BF64_SET((field), 48, 8, (x) + 1)
363367
#define L2BLK_GET_PROTECTED(field) BF64_GET((field), 56, 1)
364368
#define L2BLK_SET_PROTECTED(field, x) BF64_SET((field), 56, 1, x)
365369
#define L2BLK_GET_STATE(field) BF64_GET((field), 57, 4)
@@ -582,7 +586,9 @@ typedef struct arc_stats {
582586
kstat_named_t arcstat_hash_collisions;
583587
kstat_named_t arcstat_hash_chains;
584588
kstat_named_t arcstat_hash_chain_max;
585-
kstat_named_t arcstat_p;
589+
kstat_named_t arcstat_meta;
590+
kstat_named_t arcstat_pd;
591+
kstat_named_t arcstat_pm;
586592
kstat_named_t arcstat_c;
587593
kstat_named_t arcstat_c_min;
588594
kstat_named_t arcstat_c_max;
@@ -655,6 +661,8 @@ typedef struct arc_stats {
655661
* are all included in this value.
656662
*/
657663
kstat_named_t arcstat_anon_size;
664+
kstat_named_t arcstat_anon_data;
665+
kstat_named_t arcstat_anon_metadata;
658666
/*
659667
* Number of bytes consumed by ARC buffers that meet the
660668
* following criteria: backing buffers of type ARC_BUFC_DATA,
@@ -676,6 +684,8 @@ typedef struct arc_stats {
676684
* are all included in this value.
677685
*/
678686
kstat_named_t arcstat_mru_size;
687+
kstat_named_t arcstat_mru_data;
688+
kstat_named_t arcstat_mru_metadata;
679689
/*
680690
* Number of bytes consumed by ARC buffers that meet the
681691
* following criteria: backing buffers of type ARC_BUFC_DATA,
@@ -700,6 +710,8 @@ typedef struct arc_stats {
700710
* buffers *would have* consumed this number of bytes.
701711
*/
702712
kstat_named_t arcstat_mru_ghost_size;
713+
kstat_named_t arcstat_mru_ghost_data;
714+
kstat_named_t arcstat_mru_ghost_metadata;
703715
/*
704716
* Number of bytes that *would have been* consumed by ARC
705717
* buffers that are eligible for eviction, of type
@@ -719,6 +731,8 @@ typedef struct arc_stats {
719731
* are all included in this value.
720732
*/
721733
kstat_named_t arcstat_mfu_size;
734+
kstat_named_t arcstat_mfu_data;
735+
kstat_named_t arcstat_mfu_metadata;
722736
/*
723737
* Number of bytes consumed by ARC buffers that are eligible for
724738
* eviction, of type ARC_BUFC_DATA, and reside in the arc_mfu
@@ -737,6 +751,8 @@ typedef struct arc_stats {
737751
* arcstat_mru_ghost_size for more details.
738752
*/
739753
kstat_named_t arcstat_mfu_ghost_size;
754+
kstat_named_t arcstat_mfu_ghost_data;
755+
kstat_named_t arcstat_mfu_ghost_metadata;
740756
/*
741757
* Number of bytes that *would have been* consumed by ARC
742758
* buffers that are eligible for eviction, of type
@@ -754,6 +770,8 @@ typedef struct arc_stats {
754770
* ARC_FLAG_UNCACHED being set.
755771
*/
756772
kstat_named_t arcstat_uncached_size;
773+
kstat_named_t arcstat_uncached_data;
774+
kstat_named_t arcstat_uncached_metadata;
757775
/*
758776
* Number of data bytes that are going to be evicted from ARC due to
759777
* ARC_FLAG_UNCACHED being set.
@@ -876,10 +894,7 @@ typedef struct arc_stats {
876894
kstat_named_t arcstat_loaned_bytes;
877895
kstat_named_t arcstat_prune;
878896
kstat_named_t arcstat_meta_used;
879-
kstat_named_t arcstat_meta_limit;
880897
kstat_named_t arcstat_dnode_limit;
881-
kstat_named_t arcstat_meta_max;
882-
kstat_named_t arcstat_meta_min;
883898
kstat_named_t arcstat_async_upgrade_sync;
884899
/* Number of predictive prefetch requests. */
885900
kstat_named_t arcstat_predictive_prefetch;
@@ -987,7 +1002,7 @@ typedef struct arc_sums {
9871002
wmsum_t arcstat_memory_direct_count;
9881003
wmsum_t arcstat_memory_indirect_count;
9891004
wmsum_t arcstat_prune;
990-
aggsum_t arcstat_meta_used;
1005+
wmsum_t arcstat_meta_used;
9911006
wmsum_t arcstat_async_upgrade_sync;
9921007
wmsum_t arcstat_predictive_prefetch;
9931008
wmsum_t arcstat_demand_hit_predictive_prefetch;
@@ -1015,7 +1030,9 @@ typedef struct arc_evict_waiter {
10151030
#define ARCSTAT_BUMPDOWN(stat) ARCSTAT_INCR(stat, -1)
10161031

10171032
#define arc_no_grow ARCSTAT(arcstat_no_grow) /* do not grow cache size */
1018-
#define arc_p ARCSTAT(arcstat_p) /* target size of MRU */
1033+
#define arc_meta ARCSTAT(arcstat_meta) /* target frac of metadata */
1034+
#define arc_pd ARCSTAT(arcstat_pd) /* target frac of data MRU */
1035+
#define arc_pm ARCSTAT(arcstat_pm) /* target frac of meta MRU */
10191036
#define arc_c ARCSTAT(arcstat_c) /* target size of cache */
10201037
#define arc_c_min ARCSTAT(arcstat_c_min) /* min target cache size */
10211038
#define arc_c_max ARCSTAT(arcstat_c_max) /* max target cache size */

man/man4/zfs.4

Lines changed: 4 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -558,14 +558,6 @@ This value acts as a ceiling to the amount of dnode metadata, and defaults to
558558
which indicates that a percent which is based on
559559
.Sy zfs_arc_dnode_limit_percent
560560
of the ARC meta buffers that may be used for dnodes.
561-
.Pp
562-
Also see
563-
.Sy zfs_arc_meta_prune
564-
which serves a similar purpose but is used
565-
when the amount of metadata in the ARC exceeds
566-
.Sy zfs_arc_meta_limit
567-
rather than in response to overall demand for non-metadata.
568-
.
569561
.It Sy zfs_arc_dnode_limit_percent Ns = Ns Sy 10 Ns % Pq u64
570562
Percentage that can be consumed by dnodes of ARC meta buffers.
571563
.Pp
@@ -648,62 +640,10 @@ It cannot be set back to
648640
while running, and reducing it below the current ARC size will not cause
649641
the ARC to shrink without memory pressure to induce shrinking.
650642
.
651-
.It Sy zfs_arc_meta_adjust_restarts Ns = Ns Sy 4096 Pq uint
652-
The number of restart passes to make while scanning the ARC attempting
653-
the free buffers in order to stay below the
654-
.Sy fs_arc_meta_limit .
655-
This value should not need to be tuned but is available to facilitate
656-
performance analysis.
657-
.
658-
.It Sy zfs_arc_meta_limit Ns = Ns Sy 0 Ns B Pq u64
659-
The maximum allowed size in bytes that metadata buffers are allowed to
660-
consume in the ARC.
661-
When this limit is reached, metadata buffers will be reclaimed,
662-
even if the overall
663-
.Sy arc_c_max
664-
has not been reached.
665-
It defaults to
666-
.Sy 0 ,
667-
which indicates that a percentage based on
668-
.Sy zfs_arc_meta_limit_percent
669-
of the ARC may be used for metadata.
670-
.Pp
671-
This value my be changed dynamically, except that must be set to an explicit
672-
value
673-
.Pq cannot be set back to Sy 0 .
674-
.
675-
.It Sy zfs_arc_meta_limit_percent Ns = Ns Sy 75 Ns % Pq u64
676-
Percentage of ARC buffers that can be used for metadata.
677-
.Pp
678-
See also
679-
.Sy zfs_arc_meta_limit ,
680-
which serves a similar purpose but has a higher priority if nonzero.
681-
.
682-
.It Sy zfs_arc_meta_min Ns = Ns Sy 0 Ns B Pq u64
683-
The minimum allowed size in bytes that metadata buffers may consume in
684-
the ARC.
685-
.
686-
.It Sy zfs_arc_meta_prune Ns = Ns Sy 10000 Pq int
687-
The number of dentries and inodes to be scanned looking for entries
688-
which can be dropped.
689-
This may be required when the ARC reaches the
690-
.Sy zfs_arc_meta_limit
691-
because dentries and inodes can pin buffers in the ARC.
692-
Increasing this value will cause to dentry and inode caches
693-
to be pruned more aggressively.
694-
Setting this value to
695-
.Sy 0
696-
will disable pruning the inode and dentry caches.
697-
.
698-
.It Sy zfs_arc_meta_strategy Ns = Ns Sy 1 Ns | Ns 0 Pq uint
699-
Define the strategy for ARC metadata buffer eviction (meta reclaim strategy):
700-
.Bl -tag -compact -offset 4n -width "0 (META_ONLY)"
701-
.It Sy 0 Pq META_ONLY
702-
evict only the ARC metadata buffers
703-
.It Sy 1 Pq BALANCED
704-
additional data buffers may be evicted if required
705-
to evict the required number of metadata buffers.
706-
.El
643+
.It Sy zfs_arc_meta_balance Ns = Ns Sy 500 Pq uint
644+
Balance between metadata and data on ghost hits.
645+
Values above 100 increase metadata caching by proportionally reducing effect
646+
of ghost data hits on target data/metadata rate.
707647
.
708648
.It Sy zfs_arc_min Ns = Ns Sy 0 Ns B Pq u64
709649
Min size of ARC in bytes.
@@ -786,20 +726,6 @@ causes the ARC to start reclamation if it exceeds the target size by
786726
of the target size, and block allocations by
787727
.Em 0.6% .
788728
.
789-
.It Sy zfs_arc_p_min_shift Ns = Ns Sy 0 Pq uint
790-
If nonzero, this will update
791-
.Sy arc_p_min_shift Pq default Sy 4
792-
with the new value.
793-
.Sy arc_p_min_shift No is used as a shift of Sy arc_c
794-
when calculating the minumum
795-
.Sy arc_p No size .
796-
.
797-
.It Sy zfs_arc_p_dampener_disable Ns = Ns Sy 1 Ns | Ns 0 Pq int
798-
Disable
799-
.Sy arc_p
800-
adapt dampener, which reduces the maximum single adjustment to
801-
.Sy arc_p .
802-
.
803729
.It Sy zfs_arc_shrink_shift Ns = Ns Sy 0 Pq uint
804730
If nonzero, this will update
805731
.Sy arc_shrink_shift Pq default Sy 7

0 commit comments

Comments
 (0)