Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make lockfile version parsing more robust #20140

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Nov 10, 2023

Instead of using a regex, still parse the lockfile as JSON to extract the version instead of using a regex. This also makes it unnecessary to read the full lockfile into memory.

@fmeum fmeum force-pushed the lockfile-version branch 5 times, most recently from e15df8c to 6adbd18 Compare November 11, 2023 13:39
@fmeum fmeum marked this pull request as ready for review November 25, 2023 09:03
@github-actions github-actions bot added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. awaiting-review PR is awaiting review from an assigned reviewer labels Nov 25, 2023
Instead of using a regex, get the lockfile version by parsing the
lockfile as arbitrary JSON and then later fit it to the expected schema.
Copy link
Member

@meteorcloudy meteorcloudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Overall looks good! But @SalmaSamy should take a look to confirm.

try (var reader =
new BufferedReader(
new InputStreamReader(lockfilePath.asPath().getInputStream(), UTF_8))) {
BazelLockFileValue bazelLockFileValue = gson.fromJson(reader, BazelLockFileValue.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did consider this but we wanted to avoid parsing the whole file if the version is already old. Which I think is better.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain what worries you about parsing the whole file if the version is old? If it is about performance, I would find optimizing for the common case of a matching version more natural.

But then again, I don't know whether lockfiles will get so large that reading them into memory instead of streaming them makes a measurable difference even for large monorepos, so it's possible that there isn't much impact either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance & why parse the whole file if we might not use all these info?

  • If the version is old we will probably get a parsing exception then parse it again to get the version. So I think just getting the version first and parse it once is better.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this change definitely does more work in the case where the version doesn't match. But avoiding that work, as is done at HEAD, also comes at a cost: The lockfile is fully read into memory (instead of streaming it into the JSON parser) and searched for the regex every time, which is overhead that would be avoided with this change.

I would say it's a tradeoff and lean towards optimizing the common case, but it's quite possible that it won't make much of a difference. I can try to benchmark this, but lack access to a realistic very large MODULE.bazel file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to fully lay it out:

  • HEAD: always stores whole lockfile as a string in memory, in addition to the parsed lockfile structures; parses JSON once in happy path, 0 times otherwise
  • this PR: only stores the parsed lockfile structures; parses JSON once in happy path, twice otherwise

I wonder if we can try to "cheat" a bit and take the best of both worlds: always read the first 5 lines (or 100 characters or whatever) of the lockfile to look for the version. If happy, parse JSON by streaming, otherwise directly barf. This avoids both the memory issue and the double-parsing, and the downside of "we might have a lockfile whose version is specified lower down in the file" is negligible IMO.

Copy link
Contributor

@SalmaSamy SalmaSamy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed, check my comment.

@meteorcloudy meteorcloudy added awaiting-user-response Awaiting a response from the author and removed awaiting-review PR is awaiting review from an assigned reviewer labels Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-user-response Awaiting a response from the author team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants