-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: fmt: add iterator for format tokens #68279
Comments
Related Issues (Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.) |
I've not run into such use cases but I would imagine based on experience with similar things that an AST (or list, in this case) and a printer would either be preferable or not make much of a difference. The format is fairly simple (I can only recall the syntax changing once and one new directive being added) so this does not seem burdensome to maintain out of std |
Perhaps we have differing definitions of simple, but that wasn't the impression I got. There's ~300 lines of prose explaining the format syntax. I think (but I'm not entirely sure) that one could identify a verb because it always starts with a '%' and ends with a '%' or a letter. |
I mean that it's structurally simple. There's no recursion, few error cases, and most things of note are single runes. |
Had to run some tests to verify as the prose is not clear about how all the modifiers mix and match but the full verb format is exactly |
Interesting bit of trivia I found poking around with this. You can set flags and such on an escaped % like |
I argue that #67510 applies here. If no policy is in place, the library will have iterators implemented by a variety of authors with varying goals, with no clear overarching plan and much wasted or duplicated effort, with inconsistent and incomplete results. Please let's not iteratorify the library before a policy is in place about how such language changes should be rolled out throughout, and what the actual goals should be. |
While I certainly agree that the question of when and how iterators should be retrofitted to standard library packages is important to resolve consistently, this proposal seems to be doing two things at once:
I would suggest that the discussion here ought to prioritize 1 first, since 1 being declined presumably rules out any design with this use-case in mind, making the iterator question irrelevant, whereas declining 2 while accepting 1 leaves the door open to variants such as |
Reacting then to my own previous comment 😀 I wonder: Are there any other languages whose standard library formatting functionality -- whether While I can certainly imagine some situations where I might want to do this, they all feel rather esoteric. I'm curious to see if other languages decided this was something worth exposing, and hopefully also then find some examples of real existing uses of it that could demonstrate the value of exposing this API. Exposing the tokenizer directly in this way implies constraining future evolution of the |
I'm not sure of the answer to this question, but I would argue that Go has somewhat set the precedence for strong static analysis within the language ecosystem itself that other language are starting to copy what Go does. I've needed this function in the construction of static analysis tools and also as a mechanism to audit security (e.g., verifying that printf functionality is used in a safe and secure way). |
That is a very fair point, @dsnet. In particular your answer got me wondering if the proposed API could replace the inline parser used (indirectly) by I'm not very familiar with that code so I'm definitely not the best person to make that call, but from a superficial skim it seems to me like it would be able to do the same things it's already doing using the tokens from the proposed |
To be honest, I haven't seen the motivation and purpose of such a design at all. What should I do after parsing out the format token? |
https://docs.python.org/3/library/string.html#string.Formatter.parse is an example.
I can think of a number of places where the standard library offers some runtime utility for strings that break the code/data barrier. Is there one where the interior parsing is exposed this way? |
JavaScript has String.raw, that lets you rebuild a tagged literal. Not quite the same as this, but it can be used for somewhat similar ends. |
My first reaction to the proposal was "Oh I would be able to replace the directive parser in analysis/printf. That would be nice". But it is also the case that anything that standardizes the token sequence would work.
@dsnet Do we have motivation beyond static analysis and/or code rewriting? Both of those are already okay motivations. Just wondering if there are more.
Thinking about how to suggest code edits from these for a second. Suppose I want to suggest that "foo%" is replaced with "foo". How do I turn So it feels like the error cases are a bit underspecified at the moment. The proposal should probably be clear on what the special error values look like so we know how to edit them. What about instead returning a sequence of pairs of |
I'm not sure if we can specify all the error cases without the API being too complicated. That said, the iterated tokens exactly match the input if valid, so you could at least determine exactly where something went wrong with: var i int
for tok := range fmt.FormatTokens(f) {
if strings.HasPrefix(tok, "%!" {
log.Printf("invalid format token after %s", f[:i])
break
}
i += len(tok)
} We could also have this be |
I wrote a third party parser for this over the weekend and learned more about it than I thought possible (still cleaning it up and having the fuzzer be mean to me so no link yet). There are only 4 syntax errors—BADWIDTH, BADPREC, BADINDEX, and NOVERB—everything else is a runtime error. (Though, tbf, some of those errors can be triggered in multiple ways). For example, "%!" is a syntactically legal verb even though there is not a "!" action to dispatch. |
Although earlier I'd argued that the question of whether to expose something like this at all was probably more important than the exact representation of the result, the idea of returning For example, I could imagine I do quite like it as a general convention, though! Retaining the context about where the substring came from along with the substring itself means that it'd be trivial to filter the iterator without losing that context, such as a wrapping iterator that skips the literal tokens and only returns the "verb" tokens while retaining the position of each token for use in error messages. (It's unfortunate that the |
In Even if there were no errors involved, I really cannot imagine how just splitting strings up like this would be generally useful. It may be sufficient for some cases but as soon as you want more information from the verb you end up having to redo a lot of the work that had to be done to split it up in the first place and that's the hard part. |
Found another weird case:
|
The cost of adding this seems highly disproportionate to the value it would provide. |
I don't believe an iterator is appropriate here. Iterators are for unbounded iterators. For something this simple, a slice seems fine. Better, in fact, since a slice has the benefit that index i of the slice corresponds to index i of the arguments. So let's narrow the conversation to one of these:
With the iterator complexity removed, is either of these now worthwhile? This at least lets us focus on the functionality. Personally I am not convinced. The proposal does not attempt to justify the functionality at all. It merely says "There are use cases ...". What are the use cases? Are they widespread? Do they merit adding complexity to the standard library as opposed to just making a copy of the relevant code? The lack of answers to those questions is probably why I'm not convinced. :-) |
Another use case that came up: splitting logs in literal and dynamic parts. For example log APIs tend to allow formatting, e.g.: log.Infof("the result of calc(%q) is %d", args, 5) We could propose a contract with the user that all of the non-verb parts are literals, which are viewable by anyone. The dynamic parts (meaning any evaluated verb) would be viewable only by those with proper permissions, as it might contain sensitive information. Of course it would be possible for users to add sensitive parts directly to the format string: log.Infof("the result of calc(%q) = " + str, args) Which is why it needs to be a contract (and it may make sense to make a custom analyzer warning about this with the given API). |
This proposal has been added to the active column of the proposals project |
I've needed something like this over the years, but the most recent use case for this was something similar to what @aktau said. In my particular application, I was intercepting a formatting string for constructing HTML. In the verbiage of @aktau, the literal components were preserved as-is (i.e., without any HTML escaping), while the dynamic components were escaped (e.g., converting The fact that this was executing upon each |
This could also be mitigated by accepting a backing slice (like |
Not every problem one finds need to be solved in the core library, and in this case the problem is easily done outside, since it's just a matter of parsing the format string, which is easy enough. |
Here is a cut at parsing the format string without package fmt:
It's definitely a little tricky, and it doesn't impose the same limits on literal numbers that package fmt does, and I'm not sure I would call it "easy enough". But I also don't see how it's useful. Suppose it works. Then what? I still don't quite understand what @aktau or @dsnet would do with this code. You can't even assume that %d is safe if the argument is not an int, so you need to do some kind of substitution of arguments, probably by formatting each argument separately, putting the results into the args slice, replacing all the directives with %s, and then invoking a top-level formatting (or bypassing the top-level formatting and just collecting all the literals and escaped texts into a bytes.Buffer or strings.Builder). But formatting each argument value separately is not correct for formats like %[2]s. I don't see how having this information actually helps produce sanitized outputs. I can see that it seems like it would help, but I can't connect the dots for a perfect drop-in fix. And if you're not aiming for perfection, it seems simpler to ignore the format entirely, iterate over the args slice, leave basic integers alone, and replace all other values v with html.EscapeString(fmt.Sprint(v)). That will get %[2]s right. It won't get %-10s right, nor will it get %10d right if the value v has a custom formatter that implements the d verb. But we're not aiming for perfection at this point. I'm not strictly opposed to the API, but it would need to be useful. I still don't see how it's useful. |
I agree, it could only be a part of a solution. What we really want for the use case that @aktau described is a way to understand at what position which arguments are rendered. Let's say we have this code customlog.Infof("data for user %v not found %[2]T (%v[2])", userId, err), We want to be able to redact this data to
This needs to work across different languages and we would like to be able to show values that are explicitly marked as non-sensitive in the future. The idea is to log the message as a data structure like this (not exactly this, but just to give an idea): type Sensitivity int
const (
Unknown Sensitivity = iota
CodeLiteral
)
type Message struct {
Sensitivity Sensitivity
Value string
}
func EmitLog(msg []Message) By understanding the format string, we could separate verbs and literals and format each verb individually. That's not the best solution, but it would work. One idea for a better solution would be an API that allows us to understand how arguments are rendered in the output. Here is an API sketch: type Info struct {
Pos, End int // Start and end position
Source int // The source argument (0 for the format string, 1 for the first argument, ...)
}
func Xprintf(format string, a ..any) (string, []Info) With that, we could split up the result and tread it however we like. |
@adonovan points out that you could wrap every argument in a custom wrapper type that implements fmt.Formatter and then returns the escaped HTML for each of those. That would work, and it wouldn't require this new function. It looks less and less like we should add this. |
My most recent use case would use this function with something like: // UnsafeRawf constructs raw HTML text according to a format specifier.
// HTML in f are preserved verbatim, but text produced by % verbs are escaped.
// The only exception is %v or %s formatting a value of the [UnsafeRaw] type,
// which is preserved verbatim.
//
// Example usage:
//
// firstName := "Malfoy"
// lastName := "<script>...</script>"
// UnsafeRawf("%s<br>%s", firstName, lastName) // "Malfoy<br><script>...</script>"
func UnsafeRawf(f string, a ...any) UnsafeRaw {
var sb strings.Builder
fs := format(f)
var argIdx int
var hadIndexedArgs bool
for t := fs.nextToken(); t != ""; t = fs.nextToken() {
verb := t[len(t)-1]
switch {
case t[0] != '%': // string literal
sb.WriteString(t)
continue
case verb == '%': // escaped '%'
sb.WriteByte('%')
continue
case !strings.ContainsAny(t, "[]"): // verb with positional arguments
if argIdx < len(a) {
if raw, isRaw := a[argIdx].(UnsafeRaw); isRaw && (verb == 'v' || verb == 's') {
sb.WriteString(string(raw))
break
}
sb.WriteString(EscapeString(fmt.Sprintf(t, a[argIdx])))
break
}
sb.WriteString(EscapeString(fmt.Sprintf(t))) // deliberate missing args
default: // verb indexed arguments
hadIndexedArgs = true
if strings.HasPrefix(t, "%[") && (strings.HasSuffix(t, "]v") || strings.HasSuffix(t, "]s")) {
n, err := strconv.ParseUint(t[len("%["):len(t)-len("]v")], 10, 64)
if err == nil && uint(n-1) < uint(len(a)) {
if raw, isRaw := a[n-1].(UnsafeRaw); isRaw {
sb.WriteString(string(raw))
break
}
}
}
sb.WriteString(EscapeString(fmt.Sprintf(t, a...)))
}
argIdx++
}
return UnsafeRaw(sb.String())
} Essentially all arguments that are not a "%v" or "%s" verb on the I hadn't considered the |
@dsnet Why couldn't you intercept the print args as a []any before passing them to a formatter? |
[Edit: this doesn't work: the %T and %p verbs are handled without calling any methods of the argument. --adonovan] Here's a sketch of what I had in mind: (Choosing a separator that doesn't appear within the format string is left as an exercise.) https://go.dev/play/p/xtSw9c96ha9 // Wrappers demonstrates how to intercept printf using fmt.Formatter.
package main
import (
"fmt"
"io"
"strings"
)
func main() {
analyzef("hello %s, %-2.3d\n", "world", 123)
// Prints: "hello " ("%s" world) ", " ("%-2.3d" 123) "\n"
}
// analyze prints the components of the printf formatting operation.
func analyzef(format string, args ...any) {
for _, part := range Parsef(format, args...) {
switch part := part.(type) {
case string:
fmt.Printf("%q ", part)
case Conversion:
fmt.Printf("(%q %v) ", part, part.Arg)
}
}
fmt.Println()
}
// A Conversion represents a single % operation and its operand.
type Conversion struct {
Verb rune
Width int // or -1
Prec int // or -1
Flag string // some of "-+# 0"
Arg any
}
// String prints the conversion in its original form (e.g. "%-2.3d").
func (conv Conversion) String() string {
var out strings.Builder
fmt.Fprintf(&out, "%%%s", conv.Flag)
if conv.Width >= 0 {
fmt.Fprintf(&out, "%d", conv.Width)
}
if conv.Prec >= 0 {
fmt.Fprintf(&out, ".%d", conv.Prec)
}
fmt.Fprintf(&out, "%c", conv.Verb)
return out.String()
}
// Parsef breaks down a printf formatting operation into literal
// (string) and Conversion components.
func Parsef(format string, args ...any) []any {
const sep = "!!" // FIXME: choose a separator that doesn't appear within 'format'
var convs []Conversion
wrappers := make([]any, len(args))
for i, arg := range args {
wrappers[i] = formatFunc(func(st fmt.State, verb rune) {
io.WriteString(st, sep)
wid, ok := st.Width()
if !ok {
wid = -1
}
prec, ok := st.Precision()
if !ok {
prec = -1
}
flag := ""
for _, b := range "-+# 0" {
if st.Flag(int(b)) {
flag += string(b)
}
}
convs = append(convs, Conversion{
Verb: verb,
Width: wid,
Prec: prec,
Flag: flag,
Arg: arg,
})
})
}
s := fmt.Sprintf(format, wrappers...)
// Interleave the literals and the conversions.
var result []any
for i, word := range strings.Split(s, sep) {
if word != ""
result = append(result, word)
}
if i < len(convs) {
result = append(result, convs[i])
}
}
return result
}
type formatFunc func(fmt.State, rune)
var _ fmt.Formatter = formatFunc(nil)
func (f formatFunc) Format(st fmt.State, verb rune) { f(st, verb) } |
If this is for purely runtime wrapping maybe there needs to be something tailored to that like: func Wrappingf(w io.Writer, format string, args []any, visit func(state State, verb rune, arg any)) that could handle cases like It would still be somewhat awkward to use: fmt.Wrappingf(w, format, args, func(state fmt.State, verb rune, arg any) {
fmt.Fprintf(state, fmt.FormatString(state, verb), arg)
}) but it would be doable. |
If you are trying to defend against injection attacks, Alan's solution works as long as you just don't wrap int arguments. There are no injection attacks coming in from '3' anyway. |
Based on the discussion above, this proposal seems like a likely decline. |
No change in consensus, so declined. |
Proposal Details
There are use cases for intercepting, detecting, and/or manipulating the format string that eventually gets passed to
fmt.Sprintf
.I propose addition of:
Example:
For any valid format string, the concatenation of all the tokens is the format string.
Alternative considerations
The iterator returns values of type
string
, which can make further analysis of the formatting verb harder.As an alternatively, we could do:
Thus, users can call methods on the
FormatToken
to further drill into information encoded in the token.I don't think this API complexity justifies the utility as most of those methods can be trivially implemented by the user. For example:
The text was updated successfully, but these errors were encountered: