Skip to content

Force niofs for fdt tmp file read access when flushing stored fields #129538

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 23, 2025

Conversation

martijnvg
Copy link
Member

@martijnvg martijnvg commented Jun 17, 2025

Due to the way how stored fields get flushed when index sorting is active, it is possible that we encounter significant page cache faults when memory is scarce. In order to mitigate some of the slowness around this, we're planning to no longer mmap the fdt temp file. Initially behind a feature flag, to check for unforeseen side effects.

Typically using always mmap directory is better compared to noifs directory given there is a sufficient memory available to the OS for filesystem caching. However when that isn't the case, then indexing performance can vary a lot (often very slow). This is more true for files tmp files that stored fields create during flushing. These files exist for only a brief moment to sort stored fields in the order of the configured index sorting and are then removed. If these tmp files are mmapped there is risk to trash file system cache.

This change only avoids using mmap for the fdt tmp file. This the file that actually contains the data and can large compared to other files that get flushed. The fdm (metadata) and fdi (stored field index) remain being mmapped.

(labelling as non-issue, until feature flag has been removed)

…fields and

force direct io for checksuming fdt tmp file.
@martijnvg martijnvg changed the title Force normal read advice for stored field temp fdt files Force normal niofs for fdt tmp file read access when flushing stored fields Jun 17, 2025
@martijnvg martijnvg changed the title Force normal niofs for fdt tmp file read access when flushing stored fields Force niofs for fdt tmp file read access when flushing stored fields Jun 18, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @martijnvg, I've created a changelog YAML for you.

@martijnvg martijnvg marked this pull request as ready for review June 20, 2025 12:38
@martijnvg martijnvg requested a review from ChrisHegarty June 20, 2025 12:38
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me @martijnvg. LGTM

Longer term we should consider how to migrate to using the new IOContext Hints in 10.3, so as to avoid the fragile dependency on the file name.

@martijnvg
Copy link
Member Author

Longer term we should consider how to migrate to using the new IOContext Hints in 10.3, so as to avoid the fragile dependency on the file name.

I will add a item on the roadmap for this.

@martijnvg
Copy link
Member Author

Running the elastic/logsdb track in logsdb mode with stored source without this change a as baseline and with the change as contender shows very good results overall:

|                                                Min Throughput |                             bulk-index |    972.745       |  1173.39        |     200.643       | docs/s |  +20.63% |
|                                               Mean Throughput |                             bulk-index |  18176.4         | 29977.2         |   11800.8         | docs/s |  +64.92% |
|                                             Median Throughput |                             bulk-index |  16603.4         | 30020.8         |   13417.4         | docs/s |  +80.81% |
|                                                Max Throughput |                             bulk-index |  30518.8         | 33777.2         |    3258.42        | docs/s |  +10.68% |
|                                       50th percentile latency |                             bulk-index |   1669.65        |  1798.24        |     128.593       |     ms |   +7.70% |
|                                       90th percentile latency |                             bulk-index |   3553.13        |  3048.81        |    -504.314       |     ms |  -14.19% |
|                                       99th percentile latency |                             bulk-index |  32530.3         |  5231.3         |  -27299           |     ms |  -83.92% |
|                                     99.9th percentile latency |                             bulk-index | 338651           | 10009.8         | -328641           |     ms |  -97.04% |
|                                    99.99th percentile latency |                             bulk-index |      1.39557e+06 | 14470.3         |      -1.3811e+06  |     ms |  -98.96% |
|                                      100th percentile latency |                             bulk-index |      2.57922e+06 | 21935.1         |      -2.55728e+06 |     ms |  -99.15% |
|                                  50th percentile service time |                             bulk-index |   1672.05        |  1788.15        |     116.102       |     ms |   +6.94% |
|                                  90th percentile service time |                             bulk-index |   3501.96        |  3052.43        |    -449.524       |     ms |  -12.84% |
|                                  99th percentile service time |                             bulk-index |  32819.8         |  5273.51        |  -27546.3         |     ms |  -83.93% |
|                                99.9th percentile service time |                             bulk-index | 338084           |  9954.16        | -328130           |     ms |  -97.06% |
|                               99.99th percentile service time |                             bulk-index |      1.35413e+06 | 14453.1         |      -1.33968e+06 |     ms |  -98.93% |
|                                 100th percentile service time |                             bulk-index |      2.57922e+06 | 21935.1         |      -2.55728e+06 |     ms |  -99.15% |
|                                                    error rate |                             bulk-index |      0           |     0           |       0           |      % |    0.00% |

In particular to ~80% improvement with median indexing and some latencies improving by ~99%.

@martijnvg martijnvg merged commit 41f6981 into elastic:main Jun 23, 2025
27 checks passed
kderusso pushed a commit to kderusso/elasticsearch that referenced this pull request Jun 23, 2025
…lastic#129538)

Due to the way how stored fields get flushed when index sorting is active, it is possible that we encounter significant page cache faults when memory is scarce. In order to mitigate some of the slowness around this, we're planning to no longer mmap the fdt temp file. Initially behind a feature flag, to check for unforeseen side effects.

Typically using always mmap directory is better compared to noifs directory given there is a sufficient memory available to the OS for filesystem caching. However when that isn't the case, then indexing performance can vary a lot (often very slow). This is more true for files tmp files that stored fields create during flushing. These files exist for only a brief moment to sort stored fields in the order of the configured index sorting and are then removed. If these tmp files are mmapped there is risk to trash file system cache.

This change only avoids using mmap for the fdt tmp file. This the file that actually contains the data and can large compared to other files that get flushed. The fdm (metadata) and fdi (stored field index) remain being mmapped.
julian-elastic pushed a commit to julian-elastic/elasticsearch that referenced this pull request Jun 24, 2025
…lastic#129538)

Due to the way how stored fields get flushed when index sorting is active, it is possible that we encounter significant page cache faults when memory is scarce. In order to mitigate some of the slowness around this, we're planning to no longer mmap the fdt temp file. Initially behind a feature flag, to check for unforeseen side effects.

Typically using always mmap directory is better compared to noifs directory given there is a sufficient memory available to the OS for filesystem caching. However when that isn't the case, then indexing performance can vary a lot (often very slow). This is more true for files tmp files that stored fields create during flushing. These files exist for only a brief moment to sort stored fields in the order of the configured index sorting and are then removed. If these tmp files are mmapped there is risk to trash file system cache.

This change only avoids using mmap for the fdt tmp file. This the file that actually contains the data and can large compared to other files that get flushed. The fdm (metadata) and fdi (stored field index) remain being mmapped.
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jun 25, 2025
…lastic#129538)

Due to the way how stored fields get flushed when index sorting is active, it is possible that we encounter significant page cache faults when memory is scarce. In order to mitigate some of the slowness around this, we're planning to no longer mmap the fdt temp file. Initially behind a feature flag, to check for unforeseen side effects.

Typically using always mmap directory is better compared to noifs directory given there is a sufficient memory available to the OS for filesystem caching. However when that isn't the case, then indexing performance can vary a lot (often very slow). This is more true for files tmp files that stored fields create during flushing. These files exist for only a brief moment to sort stored fields in the order of the configured index sorting and are then removed. If these tmp files are mmapped there is risk to trash file system cache.

This change only avoids using mmap for the fdt tmp file. This the file that actually contains the data and can large compared to other files that get flushed. The fdm (metadata) and fdi (stored field index) remain being mmapped.
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Jun 30, 2025
…lastic#129538)

Due to the way how stored fields get flushed when index sorting is active, it is possible that we encounter significant page cache faults when memory is scarce. In order to mitigate some of the slowness around this, we're planning to no longer mmap the fdt temp file. Initially behind a feature flag, to check for unforeseen side effects.

Typically using always mmap directory is better compared to noifs directory given there is a sufficient memory available to the OS for filesystem caching. However when that isn't the case, then indexing performance can vary a lot (often very slow). This is more true for files tmp files that stored fields create during flushing. These files exist for only a brief moment to sort stored fields in the order of the configured index sorting and are then removed. If these tmp files are mmapped there is risk to trash file system cache.

This change only avoids using mmap for the fdt tmp file. This the file that actually contains the data and can large compared to other files that get flushed. The fdm (metadata) and fdi (stored field index) remain being mmapped.
martijnvg added a commit that referenced this pull request Jun 30, 2025
…129538) (#130312)

Due to the way how stored fields get flushed when index sorting is active, it is possible that we encounter significant page cache faults when memory is scarce. In order to mitigate some of the slowness around this, we're planning to no longer mmap the fdt temp file. Initially behind a feature flag, to check for unforeseen side effects.

Typically using always mmap directory is better compared to noifs directory given there is a sufficient memory available to the OS for filesystem caching. However when that isn't the case, then indexing performance can vary a lot (often very slow). This is more true for files tmp files that stored fields create during flushing. These files exist for only a brief moment to sort stored fields in the order of the configured index sorting and are then removed. If these tmp files are mmapped there is risk to trash file system cache.

This change only avoids using mmap for the fdt tmp file. This the file that actually contains the data and can large compared to other files that get flushed. The fdm (metadata) and fdi (stored field index) remain being mmapped.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants