feat - database schema & enqueue/dequeue logic #2096

Jamesbarford · 2025-05-01T14:41:55Z

Following on from the prototype, this is code only for the database queries and object mapping from the database.

Database logic for enqueue/dequeue/finish for queue management.
4 states; queued, in_progress, finished, failed. Which is represented as an enum in the rust code.
As per feedback should not allow creating invalid structs. I.e you can't have a finished date on a queued commit
Postgres and basic SQLite implementation

Kobzol

Thanks for splitting the PR. Apart from the comments left on code, two high-level comments:

The database logic currently assumes that each commit/target combo will be benchmarked exactly by one machine ID. This is enough for supporting ARM collectors, but on its own it doesn't support the second big use-case that we would like to have, which is benchmarking a single commit/target combo by multiple machines in parallel, to reduce the latency between getting the results. I hope that it should be possible to use the same/similar DB table schema even for that, but I haven't yet thought about it. I will try to think about how we can shoehorn that onto the proposed DB schema.
In a similar spirit, it's not clear how rebenchmarking will be actually performed after a potential failure. The code now enables the collector do benchmark a given commit multiple times, but the other tables (artifact_collection_duration, collection, collector_progress, pstat ) are not really prepared for this. I'm not sure if we should 1) ditch retry logic for now 2) use a separate DB schema for results from the new collector logic or 3) use the old DB schema and somehow make it support retrying. I would say that 3) is the best, I'll try to think about how to do it.
The logic is becoming nasty enough that I would really appreciate if we had some tests for the DB logic. Sadly, we don't really currently have any testing infrastructure prepared for that. I will try to bootstrap it by copying it from some other projects, so that we can write at least some basic sanity checks that run against an actual database.

database/src/lib.rs

Kobzol · 2025-05-05T11:15:15Z

database/src/pool.rs

@@ -178,6 +179,16 @@ pub trait Connection: Send + Sync {

    /// Removes all data associated with the given artifact.
    async fn purge_artifact(&self, aid: &ArtifactId);
+
+    /// Add a jobs to the queue


Suggested change

/// Add a jobs to the queue

/// Add jobs to the commit queue

Btw we might want to call it something else than commit (although it's not a bad name). According to our docs/glossary.md, a single act of "running all benchmarks on a single compiler artifact" should be called a "run", but in the database it's actually called a "collection" right now 😆 A bit messy. Maybe we could call it "run queue" to be consistent with the glossary? But I'm fine with keeping the "commit" name for now.

database/src/pool.rs

database/src/pool/postgres.rs

database/src/pool.rs

database/src/pool/postgres.rs

Kobzol · 2025-05-05T12:05:26Z

database/src/pool/postgres.rs

+                    FROM commit_queue
+                    WHERE target != $1
+                        AND status IN ('finished', 'in_progress')
+                        AND sha NOT IN (


I don't understand this subquery. We select commits that are either finished or in progress, but the subquery seems to filter out commits that are finished?

Agree I don't think we don't need the subquery 👍 , I think;

WHERE target != $1 AND status IN ('finished', 'in_progress')

Is sufficient.

Kobzol · 2025-05-05T12:07:04Z

database/src/pool/postgres.rs

+                    FROM commit_queue
+                    WHERE target = $1
+                        AND status = 'queued'
+                    ORDER BY pr ASC, commit_type, sha


We will probably want some CASE WHEN here to assign explicit priorities to the commit types, rather than depending on the string representation. The ordering should be:

Releases first

Master commits second, order by oldest PR ascending

Try commits last, order by oldest PR ascending

I've modified the query to have a simple CASE WHEN ranking system

Jamesbarford · 2025-05-06T07:54:09Z

Thanks for splitting the PR. Apart from the comments left on code, two high-level comments:

1. The database logic currently assumes that each commit/target combo will be benchmarked exactly by one machine ID. This is enough for supporting ARM collectors, but on its own it doesn't support the second big use-case that we would like to have, which is benchmarking a single commit/target combo by multiple machines in parallel, to reduce the latency between getting the results. I hope that it should be possible to use the same/similar DB table schema even for that, but I haven't yet thought about it. I will try to think about how we can shoehorn that onto the proposed DB schema.

2. In a similar spirit, it's not clear how rebenchmarking will be actually performed after a potential failure. The code now enables the collector do benchmark a given commit multiple times, but the other tables (`artifact_collection_duration`, `collection`, `collector_progress`, `pstat` ) are not really prepared for this. I'm not sure if we should 1) ditch retry logic for now 2) use a separate DB schema for results from the new collector logic or 3) use the old DB schema and somehow make it support retrying. I would say that 3) is the best, I'll try to think about how to do it.

3. The logic is becoming nasty enough that I would really appreciate if we had some tests for the DB logic. Sadly, we don't really currently have any testing infrastructure prepared for that. I will try to bootstrap it by copying it from some other projects, so that we can write at least some basic sanity checks that run against an actual database.

I see what you mean having N machines pick up parts of a benchmarking job to make the benchmarking faster. A possible idea would be to have another table linked to what is presently, in this PR, called commit_queue. Where in that table it has all of the different parts that make up a job and the commit_queue could broadly stay the same. Below is a rather crude ASCII diagram of what that might look like;

+--------------+
| commit_queue | # Remove machine_id
+--------------+ 
|      id      | 
+--------------+

+----------------+
| parts_of_a_job | # Naming is overly descriptive, I don't think it should be called that 
+----------------+
| cq_id          | # commit_queue.id
| name           | # Name of the thing that is getting benchmarked
| machine_id     | # Id of the machine
| target         | # e.g aarch64-unknown-linux-gnu
+----------------+

Structure for 1 queued job:

                         +--------------+
                        /| Commit Queue |\
           +-----------+ +--------------+ +------------+
           |                     |                     |                            
     +----------+          +----------+          +----------+
     | Job Part |          | Job Part |          | Job Part |
     +----------+          +----------+          +----------+
     | cq_id: 1 |          | cq_id: 1 |          | cq_id: 1 |
     +----------+          +----------+          +----------+

With this idea the queue schema part would stay similar to what we have now. However the logic to enqueue / dequeue would change. Though I don't think what we're doing now boxes us into a corner that's hard to back out of, it's pretty flexible, and we may still want a way of just looking at the queue. The above idea is only one possible way of doing it but I hope illustrates that what we're implementing here isn't incompatible with future plans.

I could be missing something however a machine failure here would mean it went offline - thus not affecting the other tables? It's only on success that it it would do so? Though I'm not sure if it should be updating the other tables on failure. Also presently if the retries hit 3, nothing would happen; merely the job would be skipped.
Yes this would be extremely helpful, my way of "testing" this was just running raw SQL which, as you've spotted, has meant that I made mistakes with some of the $<n> parameter binding. Would this be something like a Docker YAML with postgres installed in it so that can be spun up/down for tests?

Kobzol · 2025-05-06T14:40:32Z

I could be missing something however a machine failure here would mean it went offline - thus not affecting the other tables? It's only on success that it it would do so? Though I'm not sure if it should be updating the other tables on failure. Also presently if the retries hit 3, nothing would happen; merely the job would be skipped.

What I meant is that currently, if you run the same job twice, the pstat table would just accumulate more results, thus it would contain results from multiple runs, which is not great. And the other tables (e.g. collection) are not prepared for multiple runs at all, the collector would simply skip the artifact if it executed it multiple times, IIRC. So while we can pre-emptively store the retries in the DB, I probably wouldn't yet actually make the DB implement any retry logic, because the rest of the codebase isn't really prepared for it.

Yes this would be extremely helpful, my way of "testing" this was just running raw SQL which, as you've spotted, has meant that I made mistakes with some of the $ parameter binding. Would this be something like a Docker YAML with postgres installed in it so that can be spun up/down for tests?

Yeah we would essentially spin up a separate DB for each executed test, we do it like this in a bunch of other bots/services.

Jamesbarford · 2025-05-07T07:12:18Z

I could be missing something however a machine failure here would mean it went offline - thus not affecting the other tables? It's only on success that it it would do so? Though I'm not sure if it should be updating the other tables on failure. Also presently if the retries hit 3, nothing would happen; merely the job would be skipped.

What I meant is that currently, if you run the same job twice, the pstat table would just accumulate more results, thus it would contain results from multiple runs, which is not great. And the other tables (e.g. collection) are not prepared for multiple runs at all, the collector would simply skip the artifact if it executed it multiple times, IIRC. So while we can pre-emptively store the retries in the DB, I probably wouldn't yet actually make the DB implement any retry logic, because the rest of the codebase isn't really prepared for it.

Yes this would be extremely helpful, my way of "testing" this was just running raw SQL which, as you've spotted, has meant that I made mistakes with some of the $ parameter binding. Would this be something like a Docker YAML with postgres installed in it so that can be spun up/down for tests?

Yeah we would essentially spin up a separate DB for each executed test, we do it like this in a bunch of other bots/services.

For the retries, perhaps in this PR, I'll simply remove it?
I've had a look at getting that going, I'll write some tests for what I have written.

Jamesbarford force-pushed the queue/database-schema branch from b6137e5 to d1eb861 Compare May 1, 2025 15:21

Kobzol reviewed May 5, 2025

View reviewed changes

Jamesbarford added 2 commits May 6, 2025 12:26

feat - database schema & enqueue/dequeue logic

abf50d5

PR feedback - simpilfy requeuing, remove C-like OOP and correct queries

0a27dc2

Jamesbarford force-pushed the queue/database-schema branch from d1eb861 to 0a27dc2 Compare May 6, 2025 11:41

Jamesbarford closed this May 15, 2025

Jamesbarford mentioned this pull request May 20, 2025

feat - database schema & enqueue/dequeue logic #2134

Open

davidtwco mentioned this pull request Jun 2, 2025

rustc-perf improvements rust-lang/rust-project-goals#275

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat - database schema & enqueue/dequeue logic #2096

feat - database schema & enqueue/dequeue logic #2096

Uh oh!

Jamesbarford commented May 1, 2025

Uh oh!

Kobzol left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kobzol May 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kobzol May 5, 2025

Uh oh!

Jamesbarford May 6, 2025

Uh oh!

Kobzol May 5, 2025

Uh oh!

Jamesbarford May 6, 2025

Uh oh!

Jamesbarford commented May 6, 2025

Uh oh!

Kobzol commented May 6, 2025

Uh oh!

Jamesbarford commented May 7, 2025

Uh oh!

Uh oh!

feat - database schema & enqueue/dequeue logic #2096

feat - database schema & enqueue/dequeue logic #2096

Uh oh!

Conversation

Jamesbarford commented May 1, 2025

Uh oh!

Kobzol left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kobzol May 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kobzol May 5, 2025

Choose a reason for hiding this comment

Uh oh!

Jamesbarford May 6, 2025

Choose a reason for hiding this comment

Uh oh!

Kobzol May 5, 2025

Choose a reason for hiding this comment

Uh oh!

Jamesbarford May 6, 2025

Choose a reason for hiding this comment

Uh oh!

Jamesbarford commented May 6, 2025

Uh oh!

Kobzol commented May 6, 2025

Uh oh!

Jamesbarford commented May 7, 2025

Uh oh!

Uh oh!