Skip to content

Make it possible to configure backpressure delay #164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 164 commits into
base: REL_14_STABLE_neon
Choose a base branch
from

Conversation

knizhnik
Copy link

No description provided.

lubennikovaav and others added 30 commits February 9, 2022 19:34
Make smgr API pluggable. Add smgr_hook that can be used to define custom smgrs.
Remove smgrsw[] array and smgr_sw selector. Instead, smgropen() loads
f_smgr implementation using smgr_hook.

Also add smgr_init_hook and smgr_shutdown_hook.
And a lot of mechanical changes in smgr.c functions.

This patch is proposed to community: https://commitfest.postgresql.org/33/3216/

Author: anastasia <lubennikovaav@gmail.com>
Add contrib/zenith that handles interaction with remote pagestore.
To use it add 'shared_preload_library = zenith' to postgresql.conf.

It adds a protocol for network communications - see libpagestore.c;
and implements smgr API.

Also it adds several custom GUC variables:
- zenith.page_server_connstring
- zenith.callmemaybe_connstring
- zenith.zenith_timeline
- zenith.wal_redo

Authors:
Stas Kelvich <stanconn@gmail.com>
Konstantin Knizhnik <knizhnik@garret.ru>
Heikki Linnakangas <heikki.linnakangas@iki.fi>
Add WAL redo helper for zenith - alternative postgres operation mode to replay wal by pageserver request.

To start postgres in wal-redo mode, run postgres with --wal-redo option
It requires zenith shared library and zenith.wal_redo

Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Save lastWrittenPageLSN in XLogCtlData to know what pages to request from remote pageserver.

Authors:
Konstantin Knizhnik <knizhnik@garret.ru>
Heikki Linnakangas <heikki.linnakangas@iki.fi>
In the test_createdb test, we created a new database, and created a new
branch after that. I was seeing the test fail with:

    PANIC:  could not open critical system index 2662

The WAL contained records like this:

    rmgr: XLOG        len (rec/tot):     49/  8241, tx:          0, lsn: 0/0163E8F0, prev 0/0163C8A0, desc: FPI , blkref #0: rel 1663/12985/1249 fork fsm blk 1 FPW
    rmgr: XLOG        len (rec/tot):     49/  8241, tx:          0, lsn: 0/01640940, prev 0/0163E8F0, desc: FPI , blkref #0: rel 1663/12985/1249 fork fsm blk 2 FPW
    rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn: 0/01642990, prev 0/01640940, desc: RUNNING_XACTS nextXid 541 latestCompletedXid 539 oldestRunningXid 540; 1 xacts: 540
    rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/016429C8, prev 0/01642990, desc: CHECKPOINT_ONLINE redo 0/163C8A0; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 540; online
    rmgr: Database    len (rec/tot):     42/    42, tx:        540, lsn: 0/01642A40, prev 0/016429C8, desc: CREATE copy dir 1663/1 to 1663/16390
    rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn: 0/01642A70, prev 0/01642A40, desc: RUNNING_XACTS nextXid 541 latestCompletedXid 539 oldestRunningXid 540; 1 xacts: 540
    rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/01642AA8, prev 0/01642A70, desc: CHECKPOINT_ONLINE redo 0/1642A70; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 540; online
    rmgr: Transaction len (rec/tot):     66/    66, tx:        540, lsn: 0/01642B20, prev 0/01642AA8, desc: COMMIT 2021-05-21 15:55:46.363728 EEST; inval msgs: catcache 21; sync
    rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/01642B68, prev 0/01642B20, desc: CHECKPOINT_SHUTDOWN redo 0/1642B68; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown

The compute node had correctly replayed all the WAL up to the last
record, and opened up. But when you tried to connect to the new
database, the very first requests for the critical relations, like
pg_class, were made with request LSN 0/01642990. That's the last
record that's applicable to a particular block. Because the database
CREATE record didn't bump up the "last written LSN", the getpage
requests were made with too old LSN.

I fixed this by adding a SetLastWrittenLSN() call to the redo of
database CREATE record. It probably wouldn't hurt to also throw in a
call at the end of WAL replay, but let's see if we bump into more
cases like this first.

This doesn't seem to be happening with page server as of 'main'; I was
testing with a version where I had temporarily reverted all the recent
changes to reconstruct control file, checkpoints, relmapper files
etc. from the WAL records in the page server, so that the compute node
was redoing all the WAL. I'm pretty sure we need this fix even with
'main', even though this test case wasn't failing there right now.
Some operations in PostgreSQL are not WAL-logged at all (i.e. hint bits)
or delay wal-logging till the end of operation (i.e. index build).
So if such page is evicted, we will lose the update.

To fix it, we introduce PD_WAL_LOGGED bit to track whether the page was wal-logged.
If the page is evicted before it has been wal-logged, then zenith smgr creates FPI for it.

Authors:
Konstantin Knizhnik <knizhnik@garret.ru>
anastasia <lubennikovaav@gmail.com>
Add WalProposer background worker to broadcast WAL stream to Zenith WAL acceptors

Author: Konstantin Knizhnik <knizhnik@garret.ru>
Ignore unlogged table qualifier. Add respective changes to regression test outputs.

Author: Konstantin Knizhnik <knizhnik@garret.ru>
Request relation size via smgr function, not just stat(filepath).
Author: Konstantin Knizhnik <knizhnik@garret.ru>
Author: Konstantin Knizhnik <knizhnik@garret.ru>
…mmon error. TODO: add a comment, why this is fine for zenith.
…d of WAL page header, then return it back to the page origin
…of WAL at compute node

+ Check for presence of replication slot
…t inside.

WAL proposer (as bgw without BGWORKER_BACKEND_DATABASE_CONNECTION) previously
ignored SetLatch, so once caught up it stuck inside WalProposerPoll infinitely.

Futher, WaitEventSetWait didn't have timeout, so we didn't try to reconnect if
all connections are dead as well. Fix that.

Also move break on latch set to the end of the loop to attempt
ReconnectWalKeepers even if latch is constantly set.

Per test_race_conditions (Python version now).
…kpoint from WAL

+ Check for presence of zenith.signal file to allow skip reading checkpoint record from WAL

+ Pass prev_record_ptr through zenith.signal file to postgres
This patch aims to make our bespoke WAL redo machinery more robust
in the presence of untrusted (in other words, possibly malicious) inputs.

Pageserver delegates complex WAL decoding duties to postgres,
which means that the latter might fall victim to carefully designed
malicious WAL records and start doing harmful things to the system.
To prevent this, it has been decided to limit possible interactions
with the outside world using the Secure Computing BPF mode.

We use this mode to disable all syscalls not in the allowlist.
Please refer to src/backend/postmaster/seccomp.c to learn more
about the pros & cons of the current approach.

+ Fix some bugs in seccomp bpf wrapper

* Use SCMP_ACT_TRAP instead of SCMP_ACT_KILL_PROCESS to receive signals.
* Add a missing variant of select() syscall (thx to @knizhnik).
* Write error messages to an fd stderr's currently pointing to.
…ause it cause memory leak in wal-redo-postgres

2. Add check for local relations to make it possible to use DEBUG_COMPARE_LOCAL mode in SMGR

+ Call smgr_init_standard from smgr_init_zenith
this patch adds support for zenith_tenant variable. it has similar
format as zenith_timeline. It is used in callmemaybe query to pass
tenant to pageserver and in ServerInfo structure passed to wal acceptor
…recovery.

Rust's postgres_backend currently is too dummy to handle it properly: reading
happens in separate thread which just ignores CopyDone. Instead, writer thread
must get aware of termination and send CommandComplete. Also reading socket must
be transferred back to postgres_backend (or connection terminated completely
after COPY). Let's do that after more basic safkeeper refactoring and right now
cover this up to make tests pass.

ref #388
…ion position in wal_proppser to segment boundary
…ugging.

Now it contains only one function test_consume_xids() for xid wraparound testing.
@lubennikovaav
Copy link

UPD: I tested it with 10MB max_replication_write_lag and test_bulk_insert still fails on my laptop

Is it debug or release build?

It is the debug build.

I am a bit surprised that bigger delay shows better result.

I can not reproduce this result. In may case increasing zenith_backpressure_delay to 100msec just increase time to 547 seconds.

I double checked and it must have been some interferring process on my laptop. Results are around 600seconds.

May be time of test_bulk_insert is increased after reducing max_wal_replication_lag. But it is expected (more aggressive backpressure may increase write time). But it seems to me that we are going to increase test timeout in any case, because test_pageserver_chaos test

It fails due to statement timeout (it is set to 120s in python tests).
I am ok with increasing it for this test, but I think we should define the acceptable speed.
Or we can simply split this huge insert into two transactions.

lubennikovaav and others added 5 commits May 26, 2022 21:17
- zenith.page_server_connstring -> neon.pageserver_connstring
- zenith.zenith_tenant -> neon.tenant_id
- zenith.zenith_timeline -> neon.timeline_id
- zenith.max_cluster_size -> neon.max_cluster_size
as basebackup LSN always skips over page header
* Do not allocate shared memory for wal_redo process

* Add comment
@knizhnik
Copy link
Author

knizhnik commented Jun 6, 2022

i really want to receive some feedback concerning propose algorithm.
Why do I think that it is better than Vegas for our needs?
Vegas algorithm adjusts TCP window based on roundtrip time.
We do not know roundtrip time and do not have any analog of TCP window.
Certainly we can somehow try to emulate them. But it requires more changes in walsender.

Also this two algorithms are addressing different problems: in case of TCP sending more packets than receiver can consume just cause loosing packages and resends. Which even more increase network load and escalate the problem. So it is necessary to react fast.

In our case large lag just means long wait time. It is certainly bad but not fatal until wait time exceeds timeout (one minute). So out backpressure mechanism can react more fluently, trying to keep lag in the specified boundaries.

Also performance of pageserver can greatly depends on activity of other tnants and background tasks (as compaction and checkpointing). It is not so good to completely block writ transaction just because at this moment page server is busy with other tasks. If we can estimate that in some reasonable period of time pageserver will be able to catch up, then there is no need to completely suspend writer. It ca b done by measuring moving average of WAL producing/consuming speed. Choosing right window size (for moving average) may be a challenge. Because with small window we will suffer from pageserver performance fluctuation caused by background activities. Too larger window doesn't allow us to react fast enough on changed workload.

@aome510 aome510 mentioned this pull request Jun 6, 2022
2 tasks
@hlinnaka
Copy link
Contributor

hlinnaka commented Jun 6, 2022

I haven't looked very closely at the algorithm you implemented in this PR. I'll do that next, but let me reply first:

i really want to receive some feedback concerning propose algorithm. Why do I think that it is better than Vegas for our needs? Vegas algorithm adjusts TCP window based on roundtrip time. We do not know roundtrip time and do not have any analog of TCP window. Certainly we can somehow try to emulate them. But it requires more changes in walsender.

We can measure the roundtrip time, same as with TCP implementations. That's pretty straightforward.

max_replication_write_lag is the analogue of TCP send window. In TCP, the send window is the max amount of data sent, but not yet acknowledged by the receiver as received. For us, max_replication_write_lag is the max amount of WAL generated, but not yet acknowledged by the page server as received. Same thing.

To put that the other way 'round, our current mechanism with the max_replication_write_lag setting is the same as running TCP with a constant send window size.

One difference is that we cannot directly stop sending WAL, when the window is full. We can only stop after the window has already been exceeded, and pause the backend so that it won't generate even more WAL exceeding the window even more. But I don't think that difference affects the overall algorithm and math much.

Also this two algorithms are addressing different problems: in case of TCP sending more packets than receiver can consume just cause loosing packages and resends. Which even more increase network load and escalate the problem. So it is necessary to react fast.

In our case large lag just means long wait time. It is certainly bad but not fatal until wait time exceeds timeout (one minute). So out backpressure mechanism can react more fluently, trying to keep lag in the specified boundaries.

Yeah, if the send window is too large, we won't experience packet loss. We will just see increased latency. That doesn't change the overall algorithm and math though. We might want to tune the algorithm so that it settles on a somewhat larger send window size than you'd want with TCP, but the same concepts and reasoning still applies.

In fact, in the networking world, "bufferbloat" has been a hot topic in recent years. That's an analogous thing: if there are nodes in the middle of your network path that do buffering, you will start to see increased roundtrip times long before you see packet loss. And people try to avoid that.

Also performance of pageserver can greatly depends on activity of other tnants and background tasks (as compaction and checkpointing). It is not so good to completely block writ transaction just because at this moment page server is busy with other tasks. If we can estimate that in some reasonable period of time pageserver will be able to catch up, then there is no need to completely suspend writer. It ca b done by measuring moving average of WAL producing/consuming speed. Choosing right window size (for moving average) may be a challenge. Because with small window we will suffer from pageserver performance fluctuation caused by background activities. Too larger window doesn't allow us to react fast enough on changed workload.

Yeah, I'm not sure what exactly the right parameters for the algorithm here are. Any algorithm we choose will have to strike a balance between reacting quickly to changes in how fast the pageserver can process the WAL, and read latency.

Copy link
Contributor

@hlinnaka hlinnaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, some comments on the "predictable" algorithm:

ProcessInterrupts is not a predictable timer interrupt. It's very unpredictable when it gets called, and how frequently. That depends a lot on what query, and which part of query execution is currently running. As a result, the length of history you use to estimate the rx and tx speed will vary a lot. It could be 1 ms or it could be 100 s, and could swing between the extremes. It might work OK in practice, and I'm not sure what the optimal length of the history would be, but it sure seems weird.

Another weird thing is that the history is collected and the logic is executed separately in each backend, even though there is only one WAL stream. One backend might estimate that rx_speed is 100, while another backend might estimate that it's 1000, at the same instant. The second backend would feel free to generate WAL at speed 700, for example, while the first backend would feel that it needs to throttle. Assuming the real rx_speed is constant, both will converge on the correct value, but one might react much faster than the other.

The rx_speed this estimates is the rate at which the pageserver consumes WAL, not the rate it could consume it. For example, if postgres generates a slow stream of WAL at rate 1 kB / s, rx_speed will converge to 1 kB / s, even if the pageserver could process WAL at 100 MB / s. That leads to a slow start effect: if you then start a bulk operation that generates WAL at 100 MB / s, it will take a while for the estimated rx_speed to ramp up to 100 MB / s. Actually, is there anything to guarantee that it will ever ramp up? If the algorithm tries to maintain that tx_speed = rx_speed, and in the long run the average rx_speed cannot be higher than average tx_speed, and the rx_speed estimate is initially too small, how does it ever catch up to the real rx_speed? Or worse, if there's some error that causes rx_speed to be constantly estimated a little bit too low, the error will cumulate and rx_speed and tx_speed will converge to zero over time.

That said, who knows, this might work fine in practice, even if it's horrible in theory :-). We desperately need performance tests to test the behaviour in different scenarios (neondatabase/neon#1889).

@knizhnik
Copy link
Author

knizhnik commented Jun 7, 2022

ProcessInterrupts is not a predictable timer interrupt.

I agree with you. Yes, interrupts in Postgres are checked in many different places and it is hard to predicts exact interval between ProcessInterrupts invocations. But still we can estimate bounds of this interval. CHECK_FOR_INTERRUPTS can not be called too frequently, because otherwise it may affect performance. SO nobody inset it in some performance critical loop. From the other side, there should not be too long delay between CHECK_FOR_INTERRUPTS because in this case backend can not react on events fast enough. So we can assume that interval is in 1ms..1sec range.

For flow control it is enough to estimate average value of interval (and this is doe my my algorithm). By making delay in ProcessInterrupts we actually affect interval between next invocation of ProcessInterrupts - it can not be smaller than chosen delay.

So, yes - we are not able to predict precise value of interval, but we can estimate delay based on average interval value. I do not see big problem here.

Another weird thing is that the history is collected and the logic is executed separately in each backend

Yes, it is also true. But do you think that we should need to have some shared data structure and use some mutex to control access to it? It seems to be be overkill. Size of this structure is not expected to be so large and do not significantly increase memory consumption of backend. And data collected in this history (LSN) is actually global. So average speeds calculated by different backends should be mostly the same.

Also please notice that history is filled only if lag exceeds backpressure threshold. So we stattically allocate memory for this history but it is only accessed when backpressure is actually needed. Do you think that we should create this buffer dynamically on demand?

One more moment: right now backpressure is mostly needed during bulk load: when data is populated just by one backend.

The rx_speed this estimates is the rate at which the pageserver consumes WAL, not the rate it could consume it.

And once again, you are right and I realized it. But my assumption was that if pageserver is not working on maximal possible speed, then there should be no large lag between it and backend and backpressure will not be needed at all.
This algorithm start to work when pageservr can not catch up and lag exceeds specified threshold.

As far as performance of pageserver depends on so many factors, any algorithm we will use will not be precise.
We in any case have to use some probabilistic algorithm.

So I just want to once again describe the situation with backpressure (from my point of view):

  1. Backpressure is absolutely needed. We never can guaranty that pageserver will be fast enough to accept WAL faster than backends can produce them.
  2. 500MB max_replication_writ_lag is not acceptable. And reducing it to 100Mb also doesn't help. In both case we have ~1 minute delays for login and selects. With my second related PR (Cache last written LSN for last updated relations to reduce wait LSN time for queries to other relations #167) such large login delay can be prevented. But if application tries to access populated table, then it still faced with minute delays.
  3. Reducing max_replication_write_lag to 10Mb (in this case delays are within few seconds and so can be assumed as acceptable) slowdown writer about 2-3 times. It is quite noticeable. This is why I think that we should try to provide some smarter backpressure algorithm than "stop and wait". Especially taken in account that such small max_replication_write_lag may cause deadlock or starvation because walsender will just not be notified while backend is already blocked.

This is why my proposl is:

  1. Commit PR with last written LSN cache. It seems to be quite simple and safe, having not impact on performance and preventing most critical problem: long login delays.
  2. Think more about flow control algorithm. I am not completely sure that algorithm I have proposed is the best one and agree with your critics. But it is in any case better (and safer) than what we have now.
  3. Reducing max_replication_write_lag` to 10Mb with current algorithm (stop and wait) is not desirable. We can choose some "compromise" value (100Mb), although it is still nto enough to eliminate the problem. 10Mb lg threshold cn be used only with smarter backpressure algorithm.

@knizhnik knizhnik force-pushed the backpressure_delay branch from 9e40353 to 5485b70 Compare June 7, 2022 20:35
@vadim2404 vadim2404 changed the base branch from main to REL_14_STABLE_neon November 10, 2022 10:18
@MMeent MMeent force-pushed the REL_14_STABLE_neon branch from 125d0bd to a9f5034 Compare February 23, 2023 10:14
@MMeent MMeent force-pushed the REL_14_STABLE_neon branch from a2daebc to 1144aee Compare May 16, 2023 11:37
@tristan957 tristan957 force-pushed the REL_14_STABLE_neon branch 2 times, most recently from b8e5379 to 21ec61d Compare May 20, 2024 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.