Skip to content

feat - database schema & enqueue/dequeue logic #2096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Jamesbarford
Copy link
Contributor

Following on from the prototype, this is code only for the database queries and object mapping from the database.

  • Database logic for enqueue/dequeue/finish for queue management.
  • 4 states; queued, in_progress, finished, failed. Which is represented as an enum in the rust code.
  • As per feedback should not allow creating invalid structs. I.e you can't have a finished date on a queued commit
  • Postgres and basic SQLite implementation

@Jamesbarford Jamesbarford force-pushed the queue/database-schema branch from b6137e5 to d1eb861 Compare May 1, 2025 15:21
Copy link
Contributor

@Kobzol Kobzol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for splitting the PR. Apart from the comments left on code, two high-level comments:

  1. The database logic currently assumes that each commit/target combo will be benchmarked exactly by one machine ID. This is enough for supporting ARM collectors, but on its own it doesn't support the second big use-case that we would like to have, which is benchmarking a single commit/target combo by multiple machines in parallel, to reduce the latency between getting the results. I hope that it should be possible to use the same/similar DB table schema even for that, but I haven't yet thought about it. I will try to think about how we can shoehorn that onto the proposed DB schema.
  2. In a similar spirit, it's not clear how rebenchmarking will be actually performed after a potential failure. The code now enables the collector do benchmark a given commit multiple times, but the other tables (artifact_collection_duration, collection, collector_progress, pstat ) are not really prepared for this. I'm not sure if we should 1) ditch retry logic for now 2) use a separate DB schema for results from the new collector logic or 3) use the old DB schema and somehow make it support retrying. I would say that 3) is the best, I'll try to think about how to do it.
  3. The logic is becoming nasty enough that I would really appreciate if we had some tests for the DB logic. Sadly, we don't really currently have any testing infrastructure prepared for that. I will try to bootstrap it by copying it from some other projects, so that we can write at least some basic sanity checks that run against an actual database.

@@ -178,6 +179,16 @@ pub trait Connection: Send + Sync {

/// Removes all data associated with the given artifact.
async fn purge_artifact(&self, aid: &ArtifactId);

/// Add a jobs to the queue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Add a jobs to the queue
/// Add jobs to the commit queue

Btw we might want to call it something else than commit (although it's not a bad name). According to our docs/glossary.md, a single act of "running all benchmarks on a single compiler artifact" should be called a "run", but in the database it's actually called a "collection" right now 😆 A bit messy. Maybe we could call it "run queue" to be consistent with the glossary? But I'm fine with keeping the "commit" name for now.

FROM commit_queue
WHERE target != $1
AND status IN ('finished', 'in_progress')
AND sha NOT IN (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this subquery. We select commits that are either finished or in progress, but the subquery seems to filter out commits that are finished?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree I don't think we don't need the subquery 👍 , I think;

WHERE target != $1 AND status IN ('finished', 'in_progress')

Is sufficient.

FROM commit_queue
WHERE target = $1
AND status = 'queued'
ORDER BY pr ASC, commit_type, sha
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will probably want some CASE WHEN here to assign explicit priorities to the commit types, rather than depending on the string representation. The ordering should be:

  1. Releases first
  2. Master commits second, order by oldest PR ascending
  3. Try commits last, order by oldest PR ascending

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've modified the query to have a simple CASE WHEN ranking system

@Jamesbarford
Copy link
Contributor Author

Thanks for splitting the PR. Apart from the comments left on code, two high-level comments:

1. The database logic currently assumes that each commit/target combo will be benchmarked exactly by one machine ID. This is enough for supporting ARM collectors, but on its own it doesn't support the second big use-case that we would like to have, which is benchmarking a single commit/target combo by multiple machines in parallel, to reduce the latency between getting the results. I hope that it should be possible to use the same/similar DB table schema even for that, but I haven't yet thought about it. I will try to think about how we can shoehorn that onto the proposed DB schema.

2. In a similar spirit, it's not clear how rebenchmarking will be actually performed after a potential failure. The code now enables the collector do benchmark a given commit multiple times, but the other tables (`artifact_collection_duration`, `collection`, `collector_progress`, `pstat` ) are not really prepared for this. I'm not sure if we should 1) ditch retry logic for now 2) use a separate DB schema for results from the new collector logic or 3) use the old DB schema and somehow make it support retrying. I would say that 3) is the best, I'll try to think about how to do it.

3. The logic is becoming nasty enough that I would really appreciate if we had some tests for the DB logic. Sadly, we don't really currently have any testing infrastructure prepared for that. I will try to bootstrap it by copying it from some other projects, so that we can write at least some basic sanity checks that run against an actual database.
  1. I see what you mean having N machines pick up parts of a benchmarking job to make the benchmarking faster. A possible idea would be to have another table linked to what is presently, in this PR, called commit_queue. Where in that table it has all of the different parts that make up a job and the commit_queue could broadly stay the same. Below is a rather crude ASCII diagram of what that might look like;
+--------------+
| commit_queue | # Remove machine_id
+--------------+ 
|      id      | 
+--------------+

+----------------+
| parts_of_a_job | # Naming is overly descriptive, I don't think it should be called that 
+----------------+
| cq_id          | # commit_queue.id
| name           | # Name of the thing that is getting benchmarked
| machine_id     | # Id of the machine
| target         | # e.g aarch64-unknown-linux-gnu
+----------------+

Structure for 1 queued job:

                         +--------------+
                        /| Commit Queue |\
           +-----------+ +--------------+ +------------+
           |                     |                     |                            
     +----------+          +----------+          +----------+
     | Job Part |          | Job Part |          | Job Part |
     +----------+          +----------+          +----------+
     | cq_id: 1 |          | cq_id: 1 |          | cq_id: 1 |
     +----------+          +----------+          +----------+

With this idea the queue schema part would stay similar to what we have now. However the logic to enqueue / dequeue would change. Though I don't think what we're doing now boxes us into a corner that's hard to back out of, it's pretty flexible, and we may still want a way of just looking at the queue. The above idea is only one possible way of doing it but I hope illustrates that what we're implementing here isn't incompatible with future plans.

  1. I could be missing something however a machine failure here would mean it went offline - thus not affecting the other tables? It's only on success that it it would do so? Though I'm not sure if it should be updating the other tables on failure. Also presently if the retries hit 3, nothing would happen; merely the job would be skipped.

  2. Yes this would be extremely helpful, my way of "testing" this was just running raw SQL which, as you've spotted, has meant that I made mistakes with some of the $<n> parameter binding. Would this be something like a Docker YAML with postgres installed in it so that can be spun up/down for tests?

@Jamesbarford Jamesbarford force-pushed the queue/database-schema branch from d1eb861 to 0a27dc2 Compare May 6, 2025 11:41
@Kobzol
Copy link
Contributor

Kobzol commented May 6, 2025

I could be missing something however a machine failure here would mean it went offline - thus not affecting the other tables? It's only on success that it it would do so? Though I'm not sure if it should be updating the other tables on failure. Also presently if the retries hit 3, nothing would happen; merely the job would be skipped.

What I meant is that currently, if you run the same job twice, the pstat table would just accumulate more results, thus it would contain results from multiple runs, which is not great. And the other tables (e.g. collection) are not prepared for multiple runs at all, the collector would simply skip the artifact if it executed it multiple times, IIRC. So while we can pre-emptively store the retries in the DB, I probably wouldn't yet actually make the DB implement any retry logic, because the rest of the codebase isn't really prepared for it.

Yes this would be extremely helpful, my way of "testing" this was just running raw SQL which, as you've spotted, has meant that I made mistakes with some of the $ parameter binding. Would this be something like a Docker YAML with postgres installed in it so that can be spun up/down for tests?

Yeah we would essentially spin up a separate DB for each executed test, we do it like this in a bunch of other bots/services.

@Jamesbarford
Copy link
Contributor Author

I could be missing something however a machine failure here would mean it went offline - thus not affecting the other tables? It's only on success that it it would do so? Though I'm not sure if it should be updating the other tables on failure. Also presently if the retries hit 3, nothing would happen; merely the job would be skipped.

What I meant is that currently, if you run the same job twice, the pstat table would just accumulate more results, thus it would contain results from multiple runs, which is not great. And the other tables (e.g. collection) are not prepared for multiple runs at all, the collector would simply skip the artifact if it executed it multiple times, IIRC. So while we can pre-emptively store the retries in the DB, I probably wouldn't yet actually make the DB implement any retry logic, because the rest of the codebase isn't really prepared for it.

Yes this would be extremely helpful, my way of "testing" this was just running raw SQL which, as you've spotted, has meant that I made mistakes with some of the $ parameter binding. Would this be something like a Docker YAML with postgres installed in it so that can be spun up/down for tests?

Yeah we would essentially spin up a separate DB for each executed test, we do it like this in a bunch of other bots/services.

  1. For the retries, perhaps in this PR, I'll simply remove it?
  2. I've had a look at getting that going, I'll write some tests for what I have written.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants