-
Notifications
You must be signed in to change notification settings - Fork 17
Make it possible to configure backpressure delay #164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: REL_14_STABLE_neon
Are you sure you want to change the base?
Conversation
Make smgr API pluggable. Add smgr_hook that can be used to define custom smgrs. Remove smgrsw[] array and smgr_sw selector. Instead, smgropen() loads f_smgr implementation using smgr_hook. Also add smgr_init_hook and smgr_shutdown_hook. And a lot of mechanical changes in smgr.c functions. This patch is proposed to community: https://commitfest.postgresql.org/33/3216/ Author: anastasia <lubennikovaav@gmail.com>
Add contrib/zenith that handles interaction with remote pagestore. To use it add 'shared_preload_library = zenith' to postgresql.conf. It adds a protocol for network communications - see libpagestore.c; and implements smgr API. Also it adds several custom GUC variables: - zenith.page_server_connstring - zenith.callmemaybe_connstring - zenith.zenith_timeline - zenith.wal_redo Authors: Stas Kelvich <stanconn@gmail.com> Konstantin Knizhnik <knizhnik@garret.ru> Heikki Linnakangas <heikki.linnakangas@iki.fi>
Add WAL redo helper for zenith - alternative postgres operation mode to replay wal by pageserver request. To start postgres in wal-redo mode, run postgres with --wal-redo option It requires zenith shared library and zenith.wal_redo Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Save lastWrittenPageLSN in XLogCtlData to know what pages to request from remote pageserver. Authors: Konstantin Knizhnik <knizhnik@garret.ru> Heikki Linnakangas <heikki.linnakangas@iki.fi>
In the test_createdb test, we created a new database, and created a new branch after that. I was seeing the test fail with: PANIC: could not open critical system index 2662 The WAL contained records like this: rmgr: XLOG len (rec/tot): 49/ 8241, tx: 0, lsn: 0/0163E8F0, prev 0/0163C8A0, desc: FPI , blkref #0: rel 1663/12985/1249 fork fsm blk 1 FPW rmgr: XLOG len (rec/tot): 49/ 8241, tx: 0, lsn: 0/01640940, prev 0/0163E8F0, desc: FPI , blkref #0: rel 1663/12985/1249 fork fsm blk 2 FPW rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/01642990, prev 0/01640940, desc: RUNNING_XACTS nextXid 541 latestCompletedXid 539 oldestRunningXid 540; 1 xacts: 540 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/016429C8, prev 0/01642990, desc: CHECKPOINT_ONLINE redo 0/163C8A0; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 540; online rmgr: Database len (rec/tot): 42/ 42, tx: 540, lsn: 0/01642A40, prev 0/016429C8, desc: CREATE copy dir 1663/1 to 1663/16390 rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/01642A70, prev 0/01642A40, desc: RUNNING_XACTS nextXid 541 latestCompletedXid 539 oldestRunningXid 540; 1 xacts: 540 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/01642AA8, prev 0/01642A70, desc: CHECKPOINT_ONLINE redo 0/1642A70; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 540; online rmgr: Transaction len (rec/tot): 66/ 66, tx: 540, lsn: 0/01642B20, prev 0/01642AA8, desc: COMMIT 2021-05-21 15:55:46.363728 EEST; inval msgs: catcache 21; sync rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/01642B68, prev 0/01642B20, desc: CHECKPOINT_SHUTDOWN redo 0/1642B68; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown The compute node had correctly replayed all the WAL up to the last record, and opened up. But when you tried to connect to the new database, the very first requests for the critical relations, like pg_class, were made with request LSN 0/01642990. That's the last record that's applicable to a particular block. Because the database CREATE record didn't bump up the "last written LSN", the getpage requests were made with too old LSN. I fixed this by adding a SetLastWrittenLSN() call to the redo of database CREATE record. It probably wouldn't hurt to also throw in a call at the end of WAL replay, but let's see if we bump into more cases like this first. This doesn't seem to be happening with page server as of 'main'; I was testing with a version where I had temporarily reverted all the recent changes to reconstruct control file, checkpoints, relmapper files etc. from the WAL records in the page server, so that the compute node was redoing all the WAL. I'm pretty sure we need this fix even with 'main', even though this test case wasn't failing there right now.
Some operations in PostgreSQL are not WAL-logged at all (i.e. hint bits) or delay wal-logging till the end of operation (i.e. index build). So if such page is evicted, we will lose the update. To fix it, we introduce PD_WAL_LOGGED bit to track whether the page was wal-logged. If the page is evicted before it has been wal-logged, then zenith smgr creates FPI for it. Authors: Konstantin Knizhnik <knizhnik@garret.ru> anastasia <lubennikovaav@gmail.com>
Add WalProposer background worker to broadcast WAL stream to Zenith WAL acceptors Author: Konstantin Knizhnik <knizhnik@garret.ru>
Ignore unlogged table qualifier. Add respective changes to regression test outputs. Author: Konstantin Knizhnik <knizhnik@garret.ru>
Request relation size via smgr function, not just stat(filepath).
Author: Konstantin Knizhnik <knizhnik@garret.ru>
Author: Konstantin Knizhnik <knizhnik@garret.ru>
…mmon error. TODO: add a comment, why this is fine for zenith.
…d of WAL page header, then return it back to the page origin
…of WAL at compute node + Check for presence of replication slot
…t inside. WAL proposer (as bgw without BGWORKER_BACKEND_DATABASE_CONNECTION) previously ignored SetLatch, so once caught up it stuck inside WalProposerPoll infinitely. Futher, WaitEventSetWait didn't have timeout, so we didn't try to reconnect if all connections are dead as well. Fix that. Also move break on latch set to the end of the loop to attempt ReconnectWalKeepers even if latch is constantly set. Per test_race_conditions (Python version now).
…kpoint from WAL + Check for presence of zenith.signal file to allow skip reading checkpoint record from WAL + Pass prev_record_ptr through zenith.signal file to postgres
This patch aims to make our bespoke WAL redo machinery more robust in the presence of untrusted (in other words, possibly malicious) inputs. Pageserver delegates complex WAL decoding duties to postgres, which means that the latter might fall victim to carefully designed malicious WAL records and start doing harmful things to the system. To prevent this, it has been decided to limit possible interactions with the outside world using the Secure Computing BPF mode. We use this mode to disable all syscalls not in the allowlist. Please refer to src/backend/postmaster/seccomp.c to learn more about the pros & cons of the current approach. + Fix some bugs in seccomp bpf wrapper * Use SCMP_ACT_TRAP instead of SCMP_ACT_KILL_PROCESS to receive signals. * Add a missing variant of select() syscall (thx to @knizhnik). * Write error messages to an fd stderr's currently pointing to.
…ause it cause memory leak in wal-redo-postgres 2. Add check for local relations to make it possible to use DEBUG_COMPARE_LOCAL mode in SMGR + Call smgr_init_standard from smgr_init_zenith
this patch adds support for zenith_tenant variable. it has similar format as zenith_timeline. It is used in callmemaybe query to pass tenant to pageserver and in ServerInfo structure passed to wal acceptor
…recovery. Rust's postgres_backend currently is too dummy to handle it properly: reading happens in separate thread which just ignores CopyDone. Instead, writer thread must get aware of termination and send CommandComplete. Also reading socket must be transferred back to postgres_backend (or connection terminated completely after COPY). Let's do that after more basic safkeeper refactoring and right now cover this up to make tests pass. ref #388
…ion position in wal_proppser to segment boundary
…ugging. Now it contains only one function test_consume_xids() for xid wraparound testing.
It is the debug build.
I double checked and it must have been some interferring process on my laptop. Results are around 600seconds.
It fails due to statement timeout (it is set to 120s in python tests). |
- zenith.page_server_connstring -> neon.pageserver_connstring - zenith.zenith_tenant -> neon.tenant_id - zenith.zenith_timeline -> neon.timeline_id - zenith.max_cluster_size -> neon.max_cluster_size
as basebackup LSN always skips over page header
* Do not allocate shared memory for wal_redo process * Add comment
i really want to receive some feedback concerning propose algorithm. Also this two algorithms are addressing different problems: in case of TCP sending more packets than receiver can consume just cause loosing packages and resends. Which even more increase network load and escalate the problem. So it is necessary to react fast. In our case large lag just means long wait time. It is certainly bad but not fatal until wait time exceeds timeout (one minute). So out backpressure mechanism can react more fluently, trying to keep lag in the specified boundaries. Also performance of pageserver can greatly depends on activity of other tnants and background tasks (as compaction and checkpointing). It is not so good to completely block writ transaction just because at this moment page server is busy with other tasks. If we can estimate that in some reasonable period of time pageserver will be able to catch up, then there is no need to completely suspend writer. It ca b done by measuring moving average of WAL producing/consuming speed. Choosing right window size (for moving average) may be a challenge. Because with small window we will suffer from pageserver performance fluctuation caused by background activities. Too larger window doesn't allow us to react fast enough on changed workload. |
I haven't looked very closely at the algorithm you implemented in this PR. I'll do that next, but let me reply first:
We can measure the roundtrip time, same as with TCP implementations. That's pretty straightforward.
To put that the other way 'round, our current mechanism with the One difference is that we cannot directly stop sending WAL, when the window is full. We can only stop after the window has already been exceeded, and pause the backend so that it won't generate even more WAL exceeding the window even more. But I don't think that difference affects the overall algorithm and math much.
Yeah, if the send window is too large, we won't experience packet loss. We will just see increased latency. That doesn't change the overall algorithm and math though. We might want to tune the algorithm so that it settles on a somewhat larger send window size than you'd want with TCP, but the same concepts and reasoning still applies. In fact, in the networking world, "bufferbloat" has been a hot topic in recent years. That's an analogous thing: if there are nodes in the middle of your network path that do buffering, you will start to see increased roundtrip times long before you see packet loss. And people try to avoid that.
Yeah, I'm not sure what exactly the right parameters for the algorithm here are. Any algorithm we choose will have to strike a balance between reacting quickly to changes in how fast the pageserver can process the WAL, and read latency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, some comments on the "predictable" algorithm:
ProcessInterrupts is not a predictable timer interrupt. It's very unpredictable when it gets called, and how frequently. That depends a lot on what query, and which part of query execution is currently running. As a result, the length of history you use to estimate the rx and tx speed will vary a lot. It could be 1 ms or it could be 100 s, and could swing between the extremes. It might work OK in practice, and I'm not sure what the optimal length of the history would be, but it sure seems weird.
Another weird thing is that the history is collected and the logic is executed separately in each backend, even though there is only one WAL stream. One backend might estimate that rx_speed is 100, while another backend might estimate that it's 1000, at the same instant. The second backend would feel free to generate WAL at speed 700, for example, while the first backend would feel that it needs to throttle. Assuming the real rx_speed is constant, both will converge on the correct value, but one might react much faster than the other.
The rx_speed this estimates is the rate at which the pageserver consumes WAL, not the rate it could consume it. For example, if postgres generates a slow stream of WAL at rate 1 kB / s, rx_speed will converge to 1 kB / s, even if the pageserver could process WAL at 100 MB / s. That leads to a slow start effect: if you then start a bulk operation that generates WAL at 100 MB / s, it will take a while for the estimated rx_speed to ramp up to 100 MB / s. Actually, is there anything to guarantee that it will ever ramp up? If the algorithm tries to maintain that tx_speed = rx_speed, and in the long run the average rx_speed cannot be higher than average tx_speed, and the rx_speed estimate is initially too small, how does it ever catch up to the real rx_speed? Or worse, if there's some error that causes rx_speed to be constantly estimated a little bit too low, the error will cumulate and rx_speed and tx_speed will converge to zero over time.
That said, who knows, this might work fine in practice, even if it's horrible in theory :-). We desperately need performance tests to test the behaviour in different scenarios (neondatabase/neon#1889).
I agree with you. Yes, interrupts in Postgres are checked in many different places and it is hard to predicts exact interval between ProcessInterrupts invocations. But still we can estimate bounds of this interval. CHECK_FOR_INTERRUPTS can not be called too frequently, because otherwise it may affect performance. SO nobody inset it in some performance critical loop. From the other side, there should not be too long delay between CHECK_FOR_INTERRUPTS because in this case backend can not react on events fast enough. So we can assume that interval is in 1ms..1sec range. For flow control it is enough to estimate average value of interval (and this is doe my my algorithm). By making delay in ProcessInterrupts we actually affect interval between next invocation of ProcessInterrupts - it can not be smaller than chosen delay. So, yes - we are not able to predict precise value of interval, but we can estimate delay based on average interval value. I do not see big problem here.
Yes, it is also true. But do you think that we should need to have some shared data structure and use some mutex to control access to it? It seems to be be overkill. Size of this structure is not expected to be so large and do not significantly increase memory consumption of backend. And data collected in this history (LSN) is actually global. So average speeds calculated by different backends should be mostly the same. Also please notice that history is filled only if lag exceeds backpressure threshold. So we stattically allocate memory for this history but it is only accessed when backpressure is actually needed. Do you think that we should create this buffer dynamically on demand? One more moment: right now backpressure is mostly needed during bulk load: when data is populated just by one backend.
And once again, you are right and I realized it. But my assumption was that if pageserver is not working on maximal possible speed, then there should be no large lag between it and backend and backpressure will not be needed at all. As far as performance of pageserver depends on so many factors, any algorithm we will use will not be precise. So I just want to once again describe the situation with backpressure (from my point of view):
This is why my proposl is:
|
9e40353
to
5485b70
Compare
360ff1c
to
da50d99
Compare
125d0bd
to
a9f5034
Compare
a2daebc
to
1144aee
Compare
28bf5cc
to
5d5cfee
Compare
dd067cf
to
0bb356a
Compare
be7a65f
to
018fb05
Compare
b8e5379
to
21ec61d
Compare
No description provided.