Commit 54df3c7
authored
perf(cargo-package): match certain path prefix with pathspec (#14997)
### What does this PR try to resolve?
This revives #14962. See benchmark chart in
<#14962 (comment)>.
#14962 was closed because we found more bugs in `cargo package`, and
#14962 could potentially make them even harder to fix. Two of them have
been fixed so this is good to ship IMO with its own good.
---
An improvement #14955.
`check_repo_state` checks the entire git repo status. This is usually
fine if you have only a few packages in a workspace.
For huge monorepos, it may hit performance issues.
For example,
on awslabs/aws-sdk-rust@2cbd34d
the workspace has roughly 434 members to publish.
`git ls-files` reported us 204379 files in this Git repository. That
means git may need to check status of all files 434 times. That would be
`204379 * 434 = 88,700,486` checks!
Moreover, the current algorithm is finding the intersection of
`PathSource::list_files` and `git status`.
It is an `O(n^2)` check.
Let's assume files are evenly distributed into each package, so roughly
470 files per package.
If we're unlucky to have some dirty files, say 100 files. We will have
to do `470 * 100 = 47,000` times of path comparisons.
Even worse, because we `git status` everything in the repo, we'll have
to it for all members,
even when those dirty files are not part of the current package in
question. So it becomes `470 * 100 * 434 = 20,398,000`!
#### Solution
Instead of comparing with the status of the entire repository, this
patch use the magic pathspec[1] to tell git only reports paths that
match a certain path prefix.
This wouldn't help the `O(n^2)` algorithm,
but at least it won't check dirty files outside the current package.
Also, we don't `git status` against entire git worktree/index anymore.
[1]:
https://git-scm.com/docs/gitglossary#Documentation/gitglossary.txt-aiddefpathspecapathspec
### How should we test and review this PR?
Run this command against
awslabs/aws-sdk-rust@2cbd34d,
and see if it is getting better.
```
CARGO_LOG_PROFILE=1 cargor package --no-verify --offline --allow-dirty -p aws-sdk-accessanalyzer -p aws-sdk-apigateway
```
I've verified checksums of `.crate` files generated from master
(d85d761) and this commit
(3dabdcd). They are the same.
### Additional information
There are some other alternatives, like making `PathSource::list_files`
additionally reports dirty files. While we already have rooms to do it,
this approach should be the most straightforward one at this moment.
Some other approaches like
* Switch to gitoxide (I tried and it didn't as good as expected. Maybe I
did something wrong).
* A flag `--no-vcs` to skip vcs at all
* Improve the `O(n^2)` algorithm1 file changed
+21
-4
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
| 4 | + | |
4 | 5 | | |
5 | 6 | | |
6 | 7 | | |
| |||
173 | 174 | | |
174 | 175 | | |
175 | 176 | | |
176 | | - | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
177 | 180 | | |
178 | 181 | | |
179 | 182 | | |
| |||
263 | 266 | | |
264 | 267 | | |
265 | 268 | | |
266 | | - | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
267 | 274 | | |
268 | 275 | | |
269 | 276 | | |
270 | 277 | | |
271 | | - | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
272 | 281 | | |
273 | 282 | | |
274 | 283 | | |
| |||
300 | 309 | | |
301 | 310 | | |
302 | 311 | | |
303 | | - | |
| 312 | + | |
304 | 313 | | |
305 | 314 | | |
306 | 315 | | |
307 | 316 | | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
0 commit comments