Skip to content

Split the block cache into block pointer cache and block data cache #6037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

mangas
Copy link
Contributor

@mangas mangas commented May 28, 2025

Split the block cache into block pointer cache and block data cache

  • Introduce new block_pointers table to keep hash, number, parent_hash, timestamp
  • Remove columns number, parent_hash from old block cache table
  • Cache truncation now removes all the block data but not the pointers.

@mangas mangas force-pushed the filipe/chain-store-rework2 branch 4 times, most recently from a3d1291 to 4d76568 Compare May 29, 2025 10:39
@mangas mangas force-pushed the filipe/chain-store-rework2 branch from 4d76568 to a2acdaa Compare May 29, 2025 10:47
@mangas mangas changed the title Filipe/chain store rework2 Filipe/chain store rework May 29, 2025
@mangas mangas marked this pull request as ready for review May 29, 2025 10:57
@mangas mangas changed the title Filipe/chain store rework Split the block cache into block pointer cache and block data cache May 29, 2025
@mangas mangas requested a review from lutter May 29, 2025 10:59
Copy link
Collaborator

@lutter lutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This should enable a much better/logical block caching strategy

@@ -579,7 +579,7 @@ pub trait ChainStore: ChainHeadStore {
async fn block_number(
&self,
hash: &BlockHash,
) -> Result<Option<(String, BlockNumber, Option<u64>, Option<BlockHash>)>, StoreError>;
) -> Result<Option<(String, BlockNumber, Option<BlockTime>, Option<BlockHash>)>, StoreError>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all these Option still justified? I think they will all always be Some. It would also be nicer to have a struct for this. Maybe call it BlockPointer since it's one row from that table (and BlockPtr is than a small excerpt from that)

Also, this method should be renamed to block_pointer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not always a timestamp, on the shared storage model it still can be None

The option BlockTime is a little weird but I kept it because there is a different between Some(epoch time) and None, it's more idiomatic to have Option than checking BlockTime == BlockTime::NONE or MIN which are also in fact the same value (I didn't really get why).

@@ -668,7 +668,7 @@ pub trait QueryStore: Send + Sync {
async fn block_number_with_timestamp_and_parent_hash(
&self,
block_hash: &BlockHash,
) -> Result<Option<(BlockNumber, Option<u64>, Option<BlockHash>)>, StoreError>;
) -> Result<Option<(BlockNumber, Option<BlockTime>, Option<BlockHash>)>, StoreError>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this could also just be called block_pointer

@@ -0,0 +1,40 @@
DATABASE_TEST_VAR_NAME := "THEGRAPH_STORE_POSTGRES_DIESEL_URL"
DATABASE_URL := "postgresql://graph-node:let-me-in@localhost:5432/graph-node"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a justfile? This should be your local file, not something in the repo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is similar to a make file, it's intentionally to be in the repo, provides some shortcuts for common operations, you don't need to use it yourself but it's useful to have for others

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


# Requires test-deps to be running, see test-deps-up
it-test *ARGS:
just _run_in_bash cargo test --test integration_tests -- --nocapture {{ ARGS }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be just aliases in ~/.cargo/config.toml. I have e.g.

[alias]
store = "test -p graph-store-postgres"
tst = "test --workspace --exclude graph-tests"
docs = "doc --workspace --document-private-items"
gm = "install --bin graphman --path node --locked"
gmt = "install --bin graphman --path node --locked --root /var/tmp/cargo"
rt = "test -p graph-tests --test runner_tests"
it = "test -p graph-tests --test integration_tests -- --nocapture"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and that's local, this works for everyone.

INSERT INTO {nsp}.version VALUES ({version}) ON CONFLICT DO NOTHING;
",
nsp = nsp,
version = Storage::CHAINS_SCHEMA_VERSION,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this version table and mechanism, and in a way it's a denormalization.

You can find out from information_schema.tables whether the block_pointers table exists and decide based on that whether the migration needs to be run. Since everything this migration does happens in one transaction, you can be sure that the changes to the blocks table also happened and don't need to check for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about doing it this way but it's entirely possible there's other changes in the future, having a version makes it easy to figure out what is the current version of the schema and implement the different changes sequentially, it's much simpler than trying to figure out each step through pg metadata

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version table is completely unnecessary; if there are more changes in the future, they can also look at the information_schema to determine whether they have been applied or not. Plus, over time, people will forget what these version numbers mean. In any event, it would be good if the comment on this method actually explained what the migration is doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument was never that it is necessary, it is that is simpler to use and understand (portable too) but whatever, I'll change it to use psql tables...

@@ -53,6 +53,7 @@ lazy_static! {
/// The id of the sole publisher in the test data
static ref PUB1: IdVal = IdType::Bytes.parse("0xb1");
/// The chain we actually put into the chain store, blocks 0 to 3
// static ref CHAIN: Vec<FakeBlock> = vec![GENESIS_BLOCK.clone(), BLOCK_ONE.clone(), BLOCK_TWO.clone(), BLOCK_THREE.clone()];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from testing?

pub static ref BLOCK_SIX_NO_PARENT: FakeBlock = FakeBlock::make_no_parent(6, "6b834521bb753c132fdcf0e1034803ed9068e324112f8750ba93580b393a986b");
}

// Hash indicating 'no parent'
pub const NO_PARENT: &str = "0000000000000000000000000000000000000000000000000000000000000000";
/// The parts of an Ethereum block that are interesting for these tests:
/// the block number, hash, and the hash of the parent block
#[derive(Clone, Debug, PartialEq)]
#[derive(Default, Clone, Debug, PartialEq)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be Default (and there's not really a sensible default for a block)

Copy link
Contributor Author

@mangas mangas Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default here allows you to use { number x, ..Default::default() }, it's really just to make the tests a little less verbose but it turns out I didn't actually use it 😆

@@ -216,7 +216,7 @@ impl DataSource {
data_source::MappingTrigger::Offchain(trigger.clone()),
self.mapping.handler.clone(),
BlockPtr::new(Default::default(), self.creation_block.unwrap_or(0)),
BlockTime::NONE,
BlockTime::MIN,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why that change here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from testing, I'll revert, it's the exact same value, not sure why either

}
}

impl FromStr for BlockTime {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This impl is very unintuitive to me, that parsing a string will try to interpret the string as a hex/decimal number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how it was used I just move the implementation somewhere that was easier to find. The previous function was try_parse_timestamp or something similar. If it's the naming I can change it a method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed function

/// have a timestamp
pub const NONE: Self = Self(Timestamp::NONE);
// /// A timestamp from a long long time ago used to indicate that we don't
// /// have a timestamp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like some extra comment signs snuck in

@mangas mangas force-pushed the filipe/chain-store-rework2 branch from 1e10c68 to 1f7a117 Compare June 2, 2025 11:51
fn make_ddl(nsp: &str) -> String {
format!(
"
CREATE TABLE IF NOT EXISTS {nsp}.block_pointers (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THere's no need to make this idempotent. You run all this in one transaction, so either it all succeed or none of it succeeds. There's no way that this table gets created but other statements later on do not succeed.

INSERT INTO {nsp}.version VALUES ({version}) ON CONFLICT DO NOTHING;
",
nsp = nsp,
version = Storage::CHAINS_SCHEMA_VERSION,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version table is completely unnecessary; if there are more changes in the future, they can also look at the information_schema to determine whether they have been applied or not. Plus, over time, people will forget what these version numbers mean. In any event, it would be good if the comment on this method actually explained what the migration is doing.

format!(
"
CREATE TABLE IF NOT EXISTS {nsp}.block_pointers (
hash BYTEA not null primary key,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can't use the number as a pk, I was talking about a synthetic pk, like an auto-incrementing counter. But thinking about this more, what we want in the fullness of time to avoid storing block hashes redundantly is to move the data column to the block_pointers table. Really, the main point of this PR is to add a timestamp column to the blocks table without requiring a rewrite/truncation of that table. The PR is a good first step to that, and we'll address the duplication by figuring out how to get the data into the block_pointers table at some point.

@mangas mangas requested a review from lutter June 3, 2025 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants