-
Notifications
You must be signed in to change notification settings - Fork 17
Cache last written LSN for last updated relations to reduce wait LSN time for queries to other relations #167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Make smgr API pluggable. Add smgr_hook that can be used to define custom smgrs. Remove smgrsw[] array and smgr_sw selector. Instead, smgropen() loads f_smgr implementation using smgr_hook. Also add smgr_init_hook and smgr_shutdown_hook. And a lot of mechanical changes in smgr.c functions. This patch is proposed to community: https://commitfest.postgresql.org/33/3216/ Author: anastasia <lubennikovaav@gmail.com>
Add contrib/zenith that handles interaction with remote pagestore. To use it add 'shared_preload_library = zenith' to postgresql.conf. It adds a protocol for network communications - see libpagestore.c; and implements smgr API. Also it adds several custom GUC variables: - zenith.page_server_connstring - zenith.callmemaybe_connstring - zenith.zenith_timeline - zenith.wal_redo Authors: Stas Kelvich <stanconn@gmail.com> Konstantin Knizhnik <knizhnik@garret.ru> Heikki Linnakangas <heikki.linnakangas@iki.fi>
Add WAL redo helper for zenith - alternative postgres operation mode to replay wal by pageserver request. To start postgres in wal-redo mode, run postgres with --wal-redo option It requires zenith shared library and zenith.wal_redo Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Save lastWrittenPageLSN in XLogCtlData to know what pages to request from remote pageserver. Authors: Konstantin Knizhnik <knizhnik@garret.ru> Heikki Linnakangas <heikki.linnakangas@iki.fi>
In the test_createdb test, we created a new database, and created a new branch after that. I was seeing the test fail with: PANIC: could not open critical system index 2662 The WAL contained records like this: rmgr: XLOG len (rec/tot): 49/ 8241, tx: 0, lsn: 0/0163E8F0, prev 0/0163C8A0, desc: FPI , blkref #0: rel 1663/12985/1249 fork fsm blk 1 FPW rmgr: XLOG len (rec/tot): 49/ 8241, tx: 0, lsn: 0/01640940, prev 0/0163E8F0, desc: FPI , blkref #0: rel 1663/12985/1249 fork fsm blk 2 FPW rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/01642990, prev 0/01640940, desc: RUNNING_XACTS nextXid 541 latestCompletedXid 539 oldestRunningXid 540; 1 xacts: 540 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/016429C8, prev 0/01642990, desc: CHECKPOINT_ONLINE redo 0/163C8A0; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 540; online rmgr: Database len (rec/tot): 42/ 42, tx: 540, lsn: 0/01642A40, prev 0/016429C8, desc: CREATE copy dir 1663/1 to 1663/16390 rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/01642A70, prev 0/01642A40, desc: RUNNING_XACTS nextXid 541 latestCompletedXid 539 oldestRunningXid 540; 1 xacts: 540 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/01642AA8, prev 0/01642A70, desc: CHECKPOINT_ONLINE redo 0/1642A70; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 540; online rmgr: Transaction len (rec/tot): 66/ 66, tx: 540, lsn: 0/01642B20, prev 0/01642AA8, desc: COMMIT 2021-05-21 15:55:46.363728 EEST; inval msgs: catcache 21; sync rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/01642B68, prev 0/01642B20, desc: CHECKPOINT_SHUTDOWN redo 0/1642B68; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown The compute node had correctly replayed all the WAL up to the last record, and opened up. But when you tried to connect to the new database, the very first requests for the critical relations, like pg_class, were made with request LSN 0/01642990. That's the last record that's applicable to a particular block. Because the database CREATE record didn't bump up the "last written LSN", the getpage requests were made with too old LSN. I fixed this by adding a SetLastWrittenLSN() call to the redo of database CREATE record. It probably wouldn't hurt to also throw in a call at the end of WAL replay, but let's see if we bump into more cases like this first. This doesn't seem to be happening with page server as of 'main'; I was testing with a version where I had temporarily reverted all the recent changes to reconstruct control file, checkpoints, relmapper files etc. from the WAL records in the page server, so that the compute node was redoing all the WAL. I'm pretty sure we need this fix even with 'main', even though this test case wasn't failing there right now.
Some operations in PostgreSQL are not WAL-logged at all (i.e. hint bits) or delay wal-logging till the end of operation (i.e. index build). So if such page is evicted, we will lose the update. To fix it, we introduce PD_WAL_LOGGED bit to track whether the page was wal-logged. If the page is evicted before it has been wal-logged, then zenith smgr creates FPI for it. Authors: Konstantin Knizhnik <knizhnik@garret.ru> anastasia <lubennikovaav@gmail.com>
Add WalProposer background worker to broadcast WAL stream to Zenith WAL acceptors Author: Konstantin Knizhnik <knizhnik@garret.ru>
Ignore unlogged table qualifier. Add respective changes to regression test outputs. Author: Konstantin Knizhnik <knizhnik@garret.ru>
Request relation size via smgr function, not just stat(filepath).
Author: Konstantin Knizhnik <knizhnik@garret.ru>
Author: Konstantin Knizhnik <knizhnik@garret.ru>
…mmon error. TODO: add a comment, why this is fine for zenith.
…d of WAL page header, then return it back to the page origin
…of WAL at compute node + Check for presence of replication slot
…t inside. WAL proposer (as bgw without BGWORKER_BACKEND_DATABASE_CONNECTION) previously ignored SetLatch, so once caught up it stuck inside WalProposerPoll infinitely. Futher, WaitEventSetWait didn't have timeout, so we didn't try to reconnect if all connections are dead as well. Fix that. Also move break on latch set to the end of the loop to attempt ReconnectWalKeepers even if latch is constantly set. Per test_race_conditions (Python version now).
…kpoint from WAL + Check for presence of zenith.signal file to allow skip reading checkpoint record from WAL + Pass prev_record_ptr through zenith.signal file to postgres
This patch aims to make our bespoke WAL redo machinery more robust in the presence of untrusted (in other words, possibly malicious) inputs. Pageserver delegates complex WAL decoding duties to postgres, which means that the latter might fall victim to carefully designed malicious WAL records and start doing harmful things to the system. To prevent this, it has been decided to limit possible interactions with the outside world using the Secure Computing BPF mode. We use this mode to disable all syscalls not in the allowlist. Please refer to src/backend/postmaster/seccomp.c to learn more about the pros & cons of the current approach. + Fix some bugs in seccomp bpf wrapper * Use SCMP_ACT_TRAP instead of SCMP_ACT_KILL_PROCESS to receive signals. * Add a missing variant of select() syscall (thx to @knizhnik). * Write error messages to an fd stderr's currently pointing to.
…ause it cause memory leak in wal-redo-postgres 2. Add check for local relations to make it possible to use DEBUG_COMPARE_LOCAL mode in SMGR + Call smgr_init_standard from smgr_init_zenith
this patch adds support for zenith_tenant variable. it has similar format as zenith_timeline. It is used in callmemaybe query to pass tenant to pageserver and in ServerInfo structure passed to wal acceptor
…recovery. Rust's postgres_backend currently is too dummy to handle it properly: reading happens in separate thread which just ignores CopyDone. Instead, writer thread must get aware of termination and send CommandComplete. Also reading socket must be transferred back to postgres_backend (or connection terminated completely after COPY). Let's do that after more basic safkeeper refactoring and right now cover this up to make tests pass. ref #388
…ion position in wal_proppser to segment boundary
…ugging. Now it contains only one function test_consume_xids() for xid wraparound testing.
I think we must also use UPD: pushed commit with fix |
28a64eb
to
c9d8ec7
Compare
src/backend/access/transam/xlog.c
Outdated
@@ -607,6 +610,11 @@ typedef struct XLogCtlInsert | |||
WALInsertLockPadded *WALInsertLocks; | |||
} XLogCtlInsert; | |||
|
|||
typedef struct RnodeForkKey { | |||
Oid rnode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a comment about why this is safe to use only rnode+forknum
as a key
contrib/neon/pagestore_smgr.c
Outdated
@@ -1345,7 +1345,7 @@ zenith_dbsize(Oid dbNode) | |||
XLogRecPtr request_lsn; | |||
bool latest; | |||
|
|||
request_lsn = zenith_get_request_lsn(&latest); | |||
request_lsn = zenith_get_request_lsn(&latest, NULL, InvalidForkNumber); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zenith_get_request_lsn
is not ready to accept NULL argument:
GetLastWrittenPageLSN(rnode->relNode
src/backend/access/transam/xlog.c
Outdated
@@ -8828,11 +8872,41 @@ GetLastWrittenPageLSN(void) | |||
* SetLastWrittenPageLSN -- Set maximal LSN of written page |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please expand this comment to mention caching
Sorry, this fix is not correct. Let me investigate why |
I'm not sure, if I understand the problem. My "fix" just adds an unnecessary part to the key or it is incompatible in some other way? |
Relation is identified using four components: spcnode, dbnode, relnode, forknum The reason of the problem is that we sometimes pass lsn=0 in SetLastWrittenLsn. |
@hlinnaka idea is similar to what Socrates does to estimate the latest LSN for a page. Quote from section 4.4:
It's true that this approach may not work well with bulk loading if using a small bucket size for the hashmap. We can probably tweak the hash map to key by IMHO, for bulk-loading task, using |
My main concern is synchronzation needed to access this shared cache. Also I think that using PageId is key is not so good idea. In case of random updates (like in pgbench), there will e a lot of cache collisions (i.e. all baskets will have recent LSNs). May be it is better to split relation into larger chunks (i.e. 1Mb or even more) and maintain latest LSN for each chunk? It also will not help much in case of random updates. But will be more efficient in case of appending data to multiple relations. @hlinnaka - what do you think? Should I try to implement such more sophisticated caching of last recent LSNs? |
I think that's good enough. The synchronization overhead is tiny compared to sending a request over the network.
Yeah, I think that's worth exploring. If we have a more fine-grained cache with page-ids or chunks, can we use it for smgrnblocks() and smgrexists() calls too? We also have a relation size cache, so maybe
The small per-relation cache certainly helps, but it's very easy to overflow it... |
So I tried to run this patch with the new WAL backpressure tests Resultslatest
this patch + @hlinnaka's #175 (https://app.circleci.com/pipelines/github/neondatabase/neon/7286/workflows/9cba7367-a9eb-4a26-82bc-844e0bb7e9e5/jobs/74656):
Differences (baseline is this patch + Heikki's patch):
Notes
|
Co-authored-by: Thang Pham <phamducthang1234@gmail.com>
Please notice, that this LSN cache works not only only when relation is present in the cache (cache hit). |
This LSN cache is used not only in
I will try. |
Not sure that the same cache can be used both for caching relation size and keeping last written LSN. |
I have created another PR #177 with ls written LSN maintaine for each relation chunk (8Mb). |
Replaced with #191 |
See
neondatabase/neon#1763
neondatabase/neon#1793