Skip to content

[Strings] Add a string-builtins feature, and lift/lower automatically when enabled #7601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

kripken
Copy link
Member

@kripken kripken commented May 16, 2025

This makes string optimizations happen automatically when
--enable-string-builtins (or -all).

The lifting/lowering happen globally, at optimal times in the
pipeline, so even -O3 -O3 -O3 will only lift once and lower once,
avoiding overhead as in #7540 which this replaces.

TODO: document in optimizer cookbook

@kripken kripken requested a review from tlively May 16, 2025 00:00
Copy link
Member

@tlively tlively left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment on lines +432 to +434
PassRunner::Ordering ordering;
ordering.first = (i == firstDefault);
ordering.last = (i == lastDefault);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively:

Suggested change
PassRunner::Ordering ordering;
ordering.first = (i == firstDefault);
ordering.last = (i == lastDefault);
PassRunner::Ordering ordering{i == firstDefault, i == lastDefault};

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of like writing out which is first and which is last? I guess the natural order is (first, last) but it is still easier to read I think.


;; RUN: foreach %s %t wasm-opt -O2 --enable-reference-types -S -o - | filecheck %s --check-prefix=MVP
;; RUN: foreach %s %t wasm-opt -O2 -all -S -o - | filecheck %s --check-prefix=ALL
;; RUN: foreach %s %t wasm-opt -O2 --enable-reference-types --enable-string-builtins -S -o - | filecheck %s --check-prefix=ESB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you enable GC here, the printed output should be the same as with -all, so you could use one fewer check prefix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

@kripken
Copy link
Member Author

kripken commented May 16, 2025

Hmm, this PR fails many tests because of these TODOs for typed continuations:

void visitContNew(ContNew* curr) { WASM_UNREACHABLE("not implemented"); }
void visitContBind(ContBind* curr) { WASM_UNREACHABLE("not implemented"); }
void visitSuspend(Suspend* curr) { WASM_UNREACHABLE("not implemented"); }
void visitResume(Resume* curr) { WASM_UNREACHABLE("not implemented"); }
void visitResumeThrow(ResumeThrow* curr) {
WASM_UNREACHABLE("not implemented");
}
void visitStackSwitch(StackSwitch* curr) {
WASM_UNREACHABLE("not implemented");

This problem becomes noticeable in this PR because we have tests that use -all on continuations code, and now we run StringLowering, which uses SubtypingDiscoverer, so we try to operate on those instructions.

So this PR is blocked on those TODOs.

@tlively
Copy link
Member

tlively commented May 16, 2025

I can take a look at those TODOs later today, unless you plan on working on them first.

@kripken
Copy link
Member Author

kripken commented May 16, 2025

Thanks, I don't think I'd have time today myself.

tlively added a commit that referenced this pull request May 16, 2025
Now that `string` is a subtype of `extern`, the null type for strings
and externrefs is the same, so we no longer need to fix up nulls in
StringLowering.

Unblocks #7601.
tlively added a commit that referenced this pull request May 16, 2025
Now that `string` is a subtype of `extern`, the null type for strings
and externrefs is the same, so we no longer need to fix up nulls in
StringLowering.

Unblocks #7601.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why the type indices are changing here? Are we actually emitting more types in the output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are just running a type-rewriting pass we weren't before, leading to different sorting of the names.

// between.
if (wasm->features.hasStringBuiltins() && wasm->features.hasGC() &&
options.optimizeLevel >= 2 && ordering.last) {
addIfNoDWARFIssues("string-lowering-magic-imports");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use string-lowering-magic-imports-assert to make sure we aren't accidentally doing any optimizations that would result in us emitting a non-standard custom section for non-UTF-8 string constants?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, isn't the section standardized?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The standard solution is the magic imports, which can only handle valid UTF-8 strings. The custom section is a random thing we experimented with on the way to developing magic imports.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the spec solution for non-UTF8 strings, then?

Separately, if we assert here, then any module with a non-utf8 stringref will assert if you just do -all -O2... that seems wrong to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no spec solution for non-UTF-8 strings. We could synthesize them in a start function if we had to, but I don't think that should be necessary. No input module should have non-UTF-8 strings because they are not expressible with string builtins, and our optimizations should not produce new invalid non-UTF-8 strings, so no output module should have non-UTF-8 strings. Input modules that use stringref to represent non-UTF-8 strings probably don't want this lowering to occur in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do the lifting/lowering only when stringref is not enabled

Hmm, seems weird for -all to do less than -all --disable-stringref. More features should mean more things to optimize, normally.

Ok, we can do the lifting when either stringref or string-builtins are enabled, then lower only if string-builtins and not stringref are enabled.

and also make non-UTF-8 string.const a validation error when stringref is not enabled.

When stringref is disabled, any string.const (UTF8 or otherwise) is invalid anyhow?

string.const should still be valid when string-builtins are enabled. We might want to error in binary writing if they haven't been lowered and stringref is not enabled, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, these are valid options, but they all seem significantly more complex to explain and to use. The readme text would need to be substantially longer.

I don't have a better suggestion for this PR yet, but let's keep thinking here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the contrary, I don't think there's much to explain. As always, we optimize to the extent we can given the enabled features, and we validate that the IR will be written as valid modules given the enabled features

We should document that string.const is valid IR if either of the string features are enabled, but only on the condition that it will be lowered before writing if stringref is disabled. We should also document that string.const containing unpaired surrogates is only valid if stringref is enabled. I don't think we need to document how automatic lifting and lowering works, just as we don't need to document the specifics of what other optimizations we run.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say all of these quotes are non-trivial and potentially confusing for our users:

  • "do the lifting when either stringref or string-builtins are enabled, then lower only if string-builtins and not stringref are enabled"
  • "string.const is valid IR if either of the string features are enabled, but only on the condition that it will be lowered before writing if stringref is disabled"
  • "string.const containing unpaired surrogates is only valid if stringref is enabled."

Just adding that text by itself would double the current PR's readme entry.

I really feel we should find something simpler here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say all of these quotes are non-trivial and potentially confusing for our users:

  • "do the lifting when either stringref or string-builtins are enabled, then lower only if string-builtins and not stringref are enabled"

This is just us doing optimizations based on the target features. It doesn't need to be documented or understood by users.

  • "string.const is valid IR if either of the string features are enabled, but only on the condition that it will be lowered before writing if stringref is disabled"

Yeah, this one is weird. If lowering were automatic in the binary writer, this wouldn't need to be documented or understood, either. This complexity is the price we pay to make lowering a separately sequenced pass.

  • "string.const containing unpaired surrogates is only valid if stringref is enabled."

This one just matches the expressive capabilities of the underlying features. I don't think it needs to be documented beyond error messages.

Just adding that text by itself would double the current PR's readme entry.

I really feel we should find something simpler here.

This is the validation behavior we would want independent of how we lift and lower strings. The only reason this complexity comes up in this PR and not before is because we didn't have separate string features before this PR. But it's strictly more useful to users to have separate string features with detailed validation to ensure they actually produce valid binaries for the intended engine feature set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants