Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dip1036e - enhanced interpolation #15715

Merged
merged 1 commit into from
Jan 20, 2024
Merged

Conversation

adamdruppe
Copy link
Contributor

@adamdruppe adamdruppe commented Oct 20, 2023

This is based on an older draft of dip1036 but merging in the benefits from the YAIDIP.

This slight change supports interpolation of tuples and nested i-strings while still keeping the full CTFE capability of yaidip. It also retains the simplified processing of the original dip1036.

See example repo here with variety of use cases: https://github.com/adamdruppe/interpolation-examples


The interpolated expression sequence is a string literal prefixed with the letter i in the source code with embedded items with the format of $identifier or $(expression) where the identifier and expression are defined according to normal D rules. The lexer considers it a single token that may follow other token rules inside, similar to a q{} already in D. You can use \$ to put in a dollar sign followed by ( or identifier chars that does not trigger the interpolation in a double quoted i-strings. In other types of strings and i-strings, this does not apply.

Its semantics are to convert the interpolated expression sequence token into a tuple of the form:

i"foo $bar $(baz + 4) ok"
becomes
(InterpolationHeader(),
 InterpolatedLiteral!"foo "(),
 InterpolatedExpression!"bar"(),
 bar,
 InterpolatedLiteral!" "(),
 InterpolatedExpression!"baz + 4"(),
 baz + 4,
 InterpolatedLiteral!" ok"()
 InterpolationFooter())

(Please note each of the Interpol* structs there is defined in core.interpolation and is strictly looked up from that module, not from the current scope.)

That is, each part of the original string is broken up into items and written in order, with the actual value following the output in the sequence too.

This is easier to explain if you just look at the source code and/or the examples so idk why im writing this.


Only i"" is implemented in the lexer at this time but the intention is to do it for all of them. I'll come back to it eventually. The string suffixes could also be applied to the literals it passes to the templates inside if we want.

@dlang-bot
Copy link
Contributor

Thanks for your pull request and interest in making D better, @adamdruppe! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

  • My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
  • My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
  • I have provided a detailed rationale explaining my changes
  • New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.


If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + dmd#15715"

@rikkimax
Copy link
Contributor

rikkimax commented Oct 20, 2023

Only two things I'm not happy about:

  1. $$ instead of \$, enable regular escapes so that there is a least number of surprises
  2. No format per segment option available in syntax

Needs a test case, and chuck it under a preview flag awaiting DIP.

Other than these four things, this looks to be the simplest solution and by all rights should mergable.

@adamdruppe
Copy link
Contributor Author

I don't really care on $$ vs \$ except insomuch as it affects the layering. This implementation could do either equally well, but would need additional work to be added for the different kinds of string literals.

Formatting blocks are irrelevant, that's a library concern, not the language's responsibility.

@rikkimax
Copy link
Contributor

Formatting blocks are irrelevant, that's a library concern, not the language's responsibility.

I only half agree with this. Once extracted the formatting string is the library's concern. However, it is the language's responsibility to keep the format associated with the expression independent from the string that can be printed. This then matches say Python/C++'s {expr:format} syntax in terms of scope.

@adamdruppe
Copy link
Contributor Author

No, it has nothing to do with the language. The language doesn't even need to know what a format string even is. (And I suspect most uses of this feature also won't care about it.)

You can do this kind of thing if you want it formally tied together:

struct Fmt(T) {
        T what;
        string fmt;
}

auto fmt(T)(T what, string fmt) {
      return Fmt!T(what, fmt);
}

void foof(T...)(InterpolationHeader header, T args, InterpolationFooter footer) {
        foreach(arg; args) {
        pragma(msg, typeof(arg));
                static if(is(typeof(arg) == Fmt!A, A))
                        stdout.writef(arg.fmt, arg.what);
                else
                        stdout.write(arg);
        }
}

void main() {
        int a = 30, b = 40;
        foof(i"$(fmt(a, "%3d")) < $(fmt(b,"%x"))\n");

Simple implementation, simple use. Follows all existing language rules. You could also ufcs if you prefer: $(a.fmt("%3d")) is, of course, equally valid.

Let's not add unnecessary special cases to the compiler.

@schveiguy
Copy link
Member

schveiguy commented Oct 21, 2023

void foof(T...)(InterpolationHeader header, T args, InterpolationFooter footer) {

FYI, I don't think this works. Variadics must come last, only default-arg parameters are allowed after them (and those can't be inferred using IFTI)

I take it back, it does work! I must have been confused about something else.

UPDATE: Yeah, it's only parameters that have default values that can't match IFTI arguments. That's what I was thinking of.

@adamdruppe
Copy link
Contributor Author

Yeah, the reason I slapped together the implementation is to try it, so we can be sure about things like that. That said though, if you did a nested interpolated element (which lol i thought was broke for a sec because i write "i$( instead of i"$(, I'd probably recommend against nesting these for readability purposes, but it is a nice stress test of the implementation anyway), you will match it like this:

foof(i"$(i"$a")$b");

void foof(T...)(InterpolationHeader header, T args, InterpolationFooter footer) {
        foreach(arg; args) {
        pragma(msg, typeof(arg));
      }
}

Gives:

InterpolatedLiteral!""
InterpolatedExpression!"i\"$a\""
InterpolationHeader
InterpolatedLiteral!""
InterpolatedExpression!"a"
int
InterpolatedLiteral!""
InterpolationFooter
InterpolatedLiteral!""
InterpolatedExpression!"b"
int
InterpolatedLiteral!""

Which looks fine, there's the nested thing, but if you wanted to do it recursively, it wouldn't match anymore; you'd have to count open Headers and closed Footers to slice the arguments tuple; the compiler won't magically be able to do that for you (similar to regex matching parens etc).

Of course, if you just ignored that it is nested, most things would still work! But still, the bracketing making it at least possible to process it in full detail is nice.

(BTW I do think we could and should go ahead and trim out the empty interpolated literals in semantic, since the presence of InterpolatedExpression makes them redundant and the odd/even rule doesn't work once you have an interpolated tuple anyway, so probably better to not even make it look possible.)

@adamdruppe
Copy link
Contributor Author

so btw re the escape thing, $$ or \$ or whatever, there is a third option: no escape at all. You can always do $('$') under the existing rules anyway!

Copy link
Member

@WalterBright WalterBright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lacks:

  1. complete and accurate specification, or even a description
  2. comments
  3. test cases

@jmh530
Copy link

jmh530 commented Oct 23, 2023

On the interpolation-examples page, you might include a comparison of how you would do it with DIP 1027 (if possible).

@adamdruppe
Copy link
Contributor Author

adamdruppe commented Oct 24, 2023

So here's a question for everyone: what about postfixes? Supporting them in the compiler is a two line patch, forwarding the postfix to all the child strings, but that also changes the public api: InterpolatedLiteral!"foo"w is not the same type as InterpolatedLiteral!"foo".

Expanding the druntime templates to allow this is easy enough, but it'd affect every user too as they'd have to check for the various string types.

I don't think it is worth it. If you want a wstring or dstring, it is easy to call a function that creates one given the existing text.

We could pass the postfix as an argument, maybe to the InterpolationHeader, but I also don't think that'd be worth it.

So my proposal is to just go ahead and ban the postfixes on these literals, as you can see implemented in the following commit.

Any big objections?

@adamdruppe
Copy link
Contributor Author

The added failure test case asserts that there is an error with postfixes since that's banned (unless people come in with good objections, but remember, it can complicate user code so be prepared to justify it), and the runnable test case asserts things work as designed, including in some embedded char cases that might be trickier to parse.

@schveiguy
Copy link
Member

Yeah, I agree, leave suffixes alone.

I would like to see wysiwyg strings work, like they were planned in 1036. Is that a thing here?

@adamdruppe
Copy link
Contributor Author

adamdruppe commented Oct 25, 2023

Yes, like I said in the (edited) opening post: "Only i"" is implemented in the lexer at this time but the intention is to do it for all of them. "

That's the the implementation factors out the stuff into a separate function, just need to plug that into the other locations too. I'm just doing this in between other obligations so it is kinda in 30 minute spurts. I don't think those will take long to add so maybe next time.

Probably next thing to do will be i`` strings then iq{} strings.

@baryluk
Copy link

baryluk commented Oct 26, 2023

Answering here instead of the forum, where it will get lost.

Not a review, but things to consider. (As I do not see other good place for this).

Around 2007, I wrote string interpolation implementation toy library in D.

https://github.com/baryluk/echo/blob/master/echo.d

It still compiles and it works! I am sure it was written in D1, but compiles today with latest compiler without a single warning.

(there is also echo2 in echo_static.d, but that on depends on std.stdarg, so does not compile).

Pretty functional, but I did not use it that much actually. It was just proof of concept written on some lazy evening long long time ago.

If I would do it now (after using many other languages and implementing also countless printf-like interfaces from scratch in few languages, primarily C, C++, D, Python, Erlang and Go), I would:

Use Python-style:

x=5
y=1
z=666
d=9  # runtime width value
f"a {x} and {y:3} and {z:d}"  # formatted string
# 5 and   1 and       666

It is clean, allows custom formatting (width, precision/decimals, alignment using <, >, control leading zeros and force leading sign character, thousands grouping, scientific suffix notations maybe even, hex/oct/binary, hex floats, etc), and make it also variable (i.e. width can be another expression as d in example above).

Be semi-lazy, do not form a string, but instead object (could be tuple) that can be passed to proper sink in a streaming fashion (expressions in-between {} should be evaluated, but not converted to a string). Being it a tuple of heterogeneous types, is fine.

Allow specifying custom formaters, that should either be interpreted by a type being formatted, or by the sink.

Example:

from datetime import date
major = 3
minor = 11
release = date(2023, 10, 2)
print(f"Python {major}.{minor + 1:03d} is released on {release:%B %-d}")

In first version we could disallow usage of : for formatting, and flesh this out later. (Probably instead of tuple of values, pass tuple of pairs, with one being value, and other being some metadata, including line/column, formatting options, and literal expression from original format string).

Allow nesting (like in Python 3.12):

Be sure to support escape sequences properly to support char and string literals:

f"{'\n'.join(words)}"

Support multi-line and comments:

f"""Storing employee's data: {
     employee['name'].upper()  // Always uppercase name before storing
}"""
f"""{
     f'''{
         f"{f'{42}'}"
     }'''
}"""

Good error messages:

>>> f"{42 + }"
  File "<stdin>", line 1
    f"{42 + }"
          ^
SyntaxError: f-string: expecting '=', or '!', or ':', or '}'

And then consider few more things:

concatenation (especially for breaking very long ones):

 write(f"{x} {y}" ~
       f"{z}" ~
       f"suffix");

That should not form a string, but create a mega-tuple with everything. (hard to do if there are ternary operators or function calls between ~).

Escaping of { character done by {{, and for symmetry, same for } with }}.

I think $ prefix character is a noise. It is unnecessary. The only reason I can see for it is:

  • Usage in languages like Shell, Bash, Perl, PHP.

  • Convenience of using it for generating D code at compile time. Generated D code is going to have a lot of { and } sequences.

But in real life (not library and meta programming), { is rare. In fact $ is more frequent. Also it is likely easier to write things like syntax highlighter (that do not full AST building, or even full tokenization), with always balanced { and }, compared to few forms here: $(...), $id`.

Also consider:

$f(x)

It is unclear what it does:

$f (x)

$(f(x))

Same with other operators, like dot, [, etc.

Balanced {, }, makes it clear.

Allow self describing formatting:

a = 3
b = 5

print(f"a={a} b={b}")

print(f"{a=} {b=}")   # same as above

# a=3 b=5
# a=3 b=5


print(f"{a+b=} {f(x)+2=}")

# a+b=8 f(x)+2=666

And finally, be able to do lazy on formatted/interpolated strings:

void MaybeLog(T)(lazy Args args) {
  if (...) {
    static foreach (arg; args) { 
      sink(arg());
    }
  }
}

MaybeLog(f" {f(x)} {time()} {factorial(y)}");

should of course be "equivalent" to:

MaybeLog(void delegate() { return f" {f(x)} {time()} {factorial(y)}"; });

And finally consider syntax that is amenable to having similar string used for formatting where format string is known only at runtime (with some restrictions, and probably only subset of features supported). It should be possible to implement runtimeFormat("{x} {y}", string[string]["x": ..., "y"]) for example (either directly to string or to sink), or auto tpl = compileFormat(some_runtime_string()); ....; tpl.execute(sink, params....). Of course it is easier to do in dynamic languages, but still something to consider.

Adam's proposal is decent, but not perfect. (I really dislike $, and $() as mentioned).

Looking at example, it is pretty cool to use it for things like Url escaping, SQL, internationalization.

InterpolationHeader, a bit noisy, and unecassary at first glance, but I do understand this is to support "flat" nesting.

I think more natural would be, to just have it directly nested, like Some elements of the tuple to be some sub-tuples. Not sure how to type it, but should be possible. You can always flatten it later easily, and even have a helper template for this in Phobos. (You can also unflatten but it is less easy)

What about this:

foo(i"$(a)", f(), g(), i"$(y)");

How would you declare a generic function signature for this, so it is still easy to implement.

Also, of course this feature, however implemented, should be usable without normal runtime or phobos. (I.e. in embedded system or kernel on bare metal).

PS. Only support string, no dstring or wstring. I never used them in last 20 years. And never before either.

PS2. Be sure to check Swift language formatting. It is pretty well engineered. I did not use Swift personally, but it is very versatile what they do. I do not want it like that in D, but still a good source of inspiration.

PS3.

In closed dip you show these example indeed for the justification of $:

enum result = text(
    i"@property bool {name}() @safe pure nothrow @nogc const {{
        return ({store} & {maskAllElse}) != 0;
    }}
    @property void {name}(bool v) @safe pure nothrow @nogc {{
        if (v) {store} |= {maskAllElse};
        else {store} &= cast(typeof({store}))(-1-cast(typeof({store})){maskAllElse});
    }}\n"
);

vs

enum result = text(
    i"@property bool $name() @safe pure nothrow @nogc const {
        return ($store & $maskAllElse) != 0;
    }
    @property void $name(bool v) @safe pure nothrow @nogc {
        if (v) $store |= $maskAllElse;
        else $store &= cast(typeof($store))(-1-cast(typeof($store))$maskAllElse);
    }\n"
);

It is my opinion, that in fact the first one (Python style) is cleaner.

@rikkimax
Copy link
Contributor

I have the {:} syntax implemented in my formatter. Works well with date/time.

As long as we go with $(), it can be extended for ${:} later on without changing the meaning wrt. expressions.

But I am firmly in the belief that interpolated strings should be based upon double quoted strings and that includes the suffixes. Those too can be added later, so are not something that has to limit an acceptance of 1036e.

@@ -1784,9 +1866,17 @@ class Lexer
* D https://dlang.org/spec/lex.html#double_quoted_strings
* ImportC C11 6.4.5
*/
private void escapeStringConstant(Token* t)
private void escapeStringConstant(Token* t, bool supportInterpolation = false)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably better to split this into two functions (with a bit of repetition) instead. Should help with compilation speed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have measurements of compilation speed difference?

Of course, such internal implementation details can be changed at any time.

@benjones
Copy link
Contributor

benjones commented Oct 26, 2023

I have the {:} syntax implemented in my formatter. Works well with date/time.

As long as we go with $(), it can be extended for ${:} later on without changing the meaning wrt. expressions.

With this proposal, you could write your own format function that takes i"{$myVar:3}" and turns it into a call to newFormat("{%d:3}", myVar), right? Any specific formatting lib API could be built on top of this without any more compiler help, I think?

@adamdruppe
Copy link
Contributor Author

Yes, you could do that. It'd be kinda similar to the code in the 02-formatting example, but the tokens it looks for would be in both the preceding literals and the following one.

Can pretty easily throw if something is malformed too in the library code.

I really encourage people to experiment with this before dismissing it and/or demanding changes. Some of the techniques in the library are a bit of a pain - I'm not gonna tell you handling nested sequences as a unit or handling a sequence of sequences is necessarily trivial, you might have to get creative with tuple slicing and object wrapping - but it is all doable and I'm building up a little set of examples for many of these things.

The amount of things this little change to dmd enables is really remarkable.

and omg rebase again, again, on the auto-geenrated file. that's obnoxious but thankfully not hard to resolve

@WalterBright
Copy link
Member

There's this in the spec:

// only $ followed by a ( is special.
// so the double $ here is a basic one followed by an
// interpolated var. Can also use \$ in i"strings"
// then :% is interpreted *by this function* to mean
// "use this format string for the preceding argument"
writefln(i"$(name) has $$(wealth):%0.2f");

which it implies it is impossible to generate a "$(" into the output. If my suggestion is incorporated, this could be done with $$(.

@rikkimax
Copy link
Contributor

rikkimax commented Jan 5, 2024

It took me a while.

But yes, the $$ escape has been removed. The spec is out of date. It currently supports \$( instead.

Copy link
Member

@WalterBright WalterBright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs a spec PR as well. I expect it would need its own page to properly document it.

@schveiguy
Copy link
Member

I can work on the spec, thank you!

@schveiguy
Copy link
Member

Does anyone have info on why it's failing tests?

@adamdruppe
Copy link
Contributor Author

Possible it just timed out and needs a rebase and repush.

This implements the Enhanced Interpolated Expression Sequence proposal:

i"" or iq{} or q`` with a $(expression) in the middle are converted to a tuple of druntime types for future processing by library code.
@maxhaton maxhaton added the 72h no objection -> merge The PR will be merged if there are no objections raised. label Jan 17, 2024
@schveiguy
Copy link
Member

Looks like the failing test is related to a vector thing:

LINK : fatal error LNK1104: cannot open file 'generated\windows\release\64\vector.exe'
Error: linker exited with status 1104
       C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30133\bin\HostX64\x64\link.exe /NOLOGO "generated\windows\release\64\vector_cpp.obj" "generated\windows\release\64\vector.obj" /OUT:"generated\windows\release\64\vector.exe"  D:/a/1/s/generated/windows/release/64/druntime.lib /OPT:NOICF  /LIBPATH:"C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30133\lib\x64" legacy_stdio_definitions.lib /LIBPATH:"C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22621.0\ucrt\x64" /LIBPATH:"C:\Program Files (x86)\Windows Kits\10\lib\10.0.22621.0\um\x64"
make[2]: *** [Makefile:35: generated/windows/release/64/vector] Error 1
make[2]: Leaving directory 'D:/a/1/s/druntime/test/stdcpp'
make[1]: *** [Makefile:514: test/stdcpp/.run] Error 2
make[1]: Leaving directory 'D:/a/1/s/druntime'
make: *** [Makefile:457: unittest-release] Error 2

I think there are multiple tests trying to write this file at once.

@schveiguy
Copy link
Member

It's green now, I think the test might be a race condition, that intermittently fails.

@WalterBright
Copy link
Member

Looks like an unrelated test heisenbug.

@WalterBright WalterBright merged commit d8dcb94 into dlang:master Jan 20, 2024
46 checks passed
@schveiguy
Copy link
Member

Only i"" is implemented in the lexer at this time but the intention is to do it for all of them

I'm debating if I should change the changelog, or try and do the other two string types. Anyone else know how to add them? I'm not good at compiler.

@adamdruppe
Copy link
Contributor Author

Where's that quote from? The changelog and the lexer are in agreement: three forms are implemented and three forms are described (i"", i``, and iq{}).

@schveiguy
Copy link
Member

ok, it was in the top. I didn't read the whole thing, sorry.

@schveiguy
Copy link
Member

I need to pull this down to a linux box to test, dmd doesn't work on my mac...

@adamdruppe
Copy link
Contributor Author

oh yeah, oops, i wrote that months ago and never updated since october lol

I need to pull this down to a linux box to test, dmd doesn't work on my mac...

the ldc bundled with OpenD's release download supports it too https://github.com/opendlang/opend/releases/tag/CI

just sayin lol

@schveiguy
Copy link
Member

I'm writing the spec, so I need to actually play with this and see how it works. Will post a link when I'm done. Man, DDOC is so painful to write in...

@baryluk
Copy link

baryluk commented Feb 5, 2024

Hmm.

How do I escape $(

$ cat hello.d 
import std;

void main(string[] args) {
    writefln("%s", i"$\(".text);
}
$ ./linux/bin64/dmd hello.d 
hello.d(4): Error: undefined escape sequence \(
$ 

How do print literal $(, while in i"" context?

@adamdruppe
Copy link
Contributor Author

\$(...) put the escape on the dollar sign at the beginning of the sequence. Only works in i"" double quote things. In other contexts, you'd have to do like iq{$("$(")}; interpolate a literal string of the sequence.

@rikkimax
Copy link
Contributor

rikkimax commented Feb 5, 2024

Hmm.

How do I escape $(

$ cat hello.d 
import std;

void main(string[] args) {
    writefln("%s", i"$\(".text);
}
$ ./linux/bin64/dmd hello.d 
hello.d(4): Error: undefined escape sequence \(
$ 

How do print literal $(, while in i"" context?

Escape the dollar \$ instead. The dollar is the starting point for parsing out the interpolated sequence.

@baryluk
Copy link

baryluk commented Feb 5, 2024

I see.

using i"\$(".

That is not very intuitive.

Because $ without following (, can be just written as $. (Apparently \$ also works, but I do not see this documented).

@rikkimax
Copy link
Contributor

rikkimax commented Feb 5, 2024

It is the same as every other escape.

I.e. \n.

@baryluk
Copy link

baryluk commented Feb 5, 2024

It is the same as every other escape.
\n

There is finite number of escapes defined. I.e. one cannot do \q, as that is undefined escape.

https://dlang.org/spec/lex.html#escape_sequences - does not list \$. I think it should still create an error in normal string, but be accepted in interpolated strings

(I know the docs are not yet updated).

@schveiguy
Copy link
Member

FWIW, the spec is not updated. I'm working on it.

https://github.com/schveiguy/dlang.org/blob/istring/spec/istring.dd

@baryluk
Copy link

baryluk commented Feb 6, 2024

FWIW, the spec is not updated. I'm working on it.

schveiguy/dlang.org@istring/spec/istring.dd

Looking good so far.

I am assuming one can add arguments attributes, including lazy?

void processIES(Sequence...)(InterpolationHeader, lazy Sequence data, InterpolationFooter)
{
    // process data here
}

@schveiguy
Copy link
Member

Whatever works today should work there.

@schveiguy
Copy link
Member

spec is ready for review: dlang/dlang.org#3768

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
72h no objection -> merge The PR will be merged if there are no objections raised. New Language Feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.