RFC: include and exclude files by shebang contents and other criteria #648

tomeon · 2025-11-09T17:01:48Z

Intended for cases like scripts, which may lack extensions indicating their file/language type.

An incidental change is the introduction of the format.Matcher interface and various types implementing this interface. The idea is to encapsulate various bits of logic:

Does this matcher match the supplied path (for the matcher-specific sense of "match")?
Should this matcher be ignored (e.g. because it has no configured glob patterns)?
Is this matcher for including or excluding files?

I do not have very much experience in Go, so I'm not certain whether this approach is a good idea or, if so, whether it is implemented sensibly. The intentions are:

To simplify the use of matchers (assembling a CompositeMatcher is a bit verbose/unwieldy, but using it is IMO straightforward -- one call to Wants(path) for each path under consideration), and
To provide a basis for implementing additional inclusion/exclusion logic.

Thanks in advance for your consideration!

jfly

Thanks for the contribution! Have you seen the previous discussion about this in #33?

I'd prefer the approach originally proposed by @zimbatm, as it's designed to have a minimal performance impact (only actually open the file if it's executable and has no extension). Does this PR force us to read every file in a repo?

tomeon · 2025-11-11T12:28:28Z

Have you seen the previous discussion about this in #33?

D'oh 🤦. I have now; thanks :)

Does this PR force us to read every file in a repo?

I believe the answer is "no", unless the user has configured a formatter with includes = [] and an excludes (and/or global excludes) that doesn't exclude anything: the glob-based matchers are invoked before the MIME type matchers, so a given file is opened and read only if (a) it has not already been rejected by a glob matcher and (b) the relevant MIME type lists are non-empty. I've also updated this PR to include caching of detected MIME types.

[O]nly actually open the file if it's executable and has no extension

Would it make sense to implement this as an opt-out switch? E.g mimetypes-consider-all-files = true?

jfly · 2025-11-11T18:04:40Z

(Full disclosure: I still haven't read the code, sorry.) But perhaps it would be useful to split the conversation into 2 questions:

Which files should we read in order to apply a heuristic to determine the "file type"?
What should that file contents based heuristic be?

For 1), I see no reason to consider files other than files both have no extension and are executable. Can you think of a situation this would not handle well?

For 2), I we have 2 options:

Write something ourselves (as proposed in Handle scripts with no file extension #33)
OR, use the https://github.com/gabriel-vasile/mimetype library you've added in this PR

I wonder if we're better off writing something ourselves. The mimetype library is 6k+ lines of go code that I haven't read, and it handles a whole bunch of things we don't care about (image files, archive formats, etc).

zimbatm · 2025-11-14T17:34:31Z

I like that you just went ahead and implemented a solution.

To solve the executable matching, I believe the simplest solution would be to parse the shebangs and extract which executable will be running. With a bit of heuristics to handle #!/usr/bin/env.

In that case we could add a include_shebang = [ "ruby" ].

Would you be up for implementing something like that instead?

tomeon · 2025-11-21T14:28:35Z

If matching on the parsed shebangs of extension-less executable scripts is precisely the feature set that this project is prepared to accept, then I am happy to rework this PR in the manner described:

Matching such scripts is my use-case, and apparently the same is true of (or at least the same use-case has occurred to) several other folks,
At least some MIME type detection suites use shebangs for detecting certain types of script, and
I've done this sort of thing before.

For what it's worth, I can think of some use-cases that fail one or more of these strictures:

git-sh-setup-style shell libraries that aren't executable because they are meant to be sourced rather than run as scripts, and
Files like anything-sync-daemon.in that have an extension that indicates something other than their implementation language (in this case, the fact that the file requires preprocessing by the build system), which could be especially troublesome in polyglot projects where the same extension may be used by files implemented in different languages.

Of course, the fact that I can think of these use-cases is no strong indication that any current or potential treefmt users actually have them (nor is the fact that I haven't thought of other use-cases any indication that users don't have them). But I think it's worth considering the risk that eventually accommodating such use-cases might require introducing either backward incompatibilities or configuration parameters along the lines of the above-suggested mimetypes-consider-all-files (which aren't terribly ergonomic, IMO).

I'd therefore like to float an idea inspired in part by this comment on #33: using text/template filters executed in the context of walk.File structures to select and reject files. I've implemented this in the lastest force-pushed commit of this PR, along with some helper methods on walk.File like Shebang(), InterpreterName(), IsExecutable(), and LooksLikeScript() (is executable, has a shebang, has no extension), plus the template functions fnmatch (glob matching) and rematch (regexp matching).

This permits configurations like the following:

[formatter.mylanguage]
command = "mycommand"
select = ["{{ and .LooksLikeScript (rematch `^possibly-versioned-exe[[:digit:]]*$` .InterpreterName) }}"]

Please let me know what you think of this idea. Thanks!

cmd/init/init.toml

This changeset also introduces some new methods on `walk.File`, including a method for extracting a shebang (if present), a method for extracting the interpreter portion of the shebang, a method indicating whether or not the file is executable, and a convenience method that indicates if the file "looks like" a script.

jfly · 2025-12-02T04:44:43Z

Thanks for the examples! Comments inline.

For what it's worth, I can think of some use-cases that fail one or more of these strictures:

git-sh-setup-style shell libraries that aren't executable because they are meant to be sourced rather than run as scripts, and

Would these even have shebangs? I wouldn't expect shell libraries to have them, I would expect them to have file extensions. The file in git itself notably doesn't have a shebang and does have an extension: https://github.com/git/git/blob/master/git-sh-setup.sh.

Files like anything-sync-daemon.in that have an extension that indicates something other than their implementation language (in this case, the fact that the file requires preprocessing by the build system), which could be especially troublesome in polyglot projects where the same extension may be used by files implemented in different languages.

Good example! To extend @zimbatm's original files_executable proposal, we could make it a map from extension to shebang executables. For example:

files_executable = { "" = [ "bash" ], "*.in" = [ "bash" ] }

I'd therefore like to float an idea inspired in part by #33 (comment): using text/template filters executed in the context of walk.File structures to select and reject files

This is a very cool idea! It has some nice properties, and is very flexible. One downside is that go becomes less of an implementation detail. I'm curious what @zimbatm and @brianmcgee think. I personally would probably prefer something simpler and less flexible (like the original files_executable proposal, or the modified one above.

tomeon · 2025-12-08T13:32:25Z

Would these even have shebangs? I wouldn't expect shell libraries to have them, I would expect them to have file extensions.

The point is that not all shell scripts without extensions are executable. If you are looking for an example stored in source in this form, here's one. Such scripts may indeed have shebangs: for instance, I've used shebangs in non-executable scripts to inform ShellCheck of the target shell dialect, and to inform my editor that I want shell syntax highlighting.

To extend @zimbatm's original files_executable proposal, we could make it a map from extension to shebang executables:

This doesn't account for the fact that *.in files may not be executable. anything-sync-daemon.in is itself an example of this: it is stored with mode 0644 but installed with mode 0755.

If the maintainers of this project stipulate that executable, (possibly?) extension-less scripts are the only use-case they wish to support, then I am happy to rework this PR. But I can't agree that this is the only use-case, or the only use-case worth supporting.

jfly · 2025-12-08T15:57:15Z

The whole discussion here is about how opinionated treefmt should be here. Your latest idea with go templates is an elegant way to achieve maximal flexibility, but I'd prefer treefmt take a more opinionated stance on how we expect scripts to be written. Specifically:

Scripts that are intended to be executed should have a shebang and the executable bit.
Libraries should have extensions that reflect the language they're written in (.bash, .zsh, .py, etc).

You're absolutely correct that this does not cover all use cases. If projects cannot (or prefer not) to structure their code this way, they can still use treefmt's include mechanism to explicitly include those exceptions (ooh would be interesting if we could put multiple treefmt.toml files in a project to scope rules by directory).

If the maintainers of this project stipulate that executable, (possibly?) extension-less scripts are the only use-case they wish to support, then I am happy to rework this PR

@tomeon, thank you for staying engaged throughout this somewhat bike-sheddy conversation. I will see if I can get the other maintainers to weigh in so we can give you a clear "path to acceptance" of this PR.

brianmcgee · 2025-12-12T10:41:09Z

@tomeon I'd like to reiterate what @jfly said, thanks for hanging in there with this 🙏 . My focus has been elsewhere these past few weeks.

I'm going to give this some proper brain time this morning and have a look at what you implemented.

Simplifies the Matcher approach, removing the complicated Matcher types in favour of simple functions. Signed-off-by: Brian McGee <brian@bmcgee.ie>

brianmcgee · 2025-12-12T13:08:20Z

@tomeon, I spent some time reviewing the approach and pushed a commit that simplifies the implementation.

I like how flexible the template approach is. It's a nice progressive enhancement users can reach for when they need something more powerful than just globs.

Where I'm undecided yet is the potential for issues or user confusion resulting from the interplay of the two mechanisms and the fact that they can now get themselves into "a lot more trouble" with templates. This could be addressed with clearer logging, perhaps.

Some of what @jfly talks about also resonates.

TLDR: It's an intriguing approach; I'm undecided whether it's the right one yet. I'd like to let it marinate a bit more and get @zimbatm to weigh in, too.

jfly reviewed Nov 10, 2025

View reviewed changes

tomeon force-pushed the mimetype-detection branch from c0e672c to 176a18f Compare November 11, 2025 12:26

tomeon force-pushed the mimetype-detection branch from 176a18f to 8e4ae84 Compare November 21, 2025 14:01

tomeon commented Nov 21, 2025

View reviewed changes

cmd/init/init.toml Outdated Show resolved Hide resolved

tomeon force-pushed the mimetype-detection branch from 8e4ae84 to 7bb0563 Compare November 21, 2025 14:40

tomeon changed the title ~~RFC: include and exclude files by detecting their MIME type~~ RFC: include and exclude files by shebang contents and other criteria Dec 8, 2025

fixup! feat: include/exclude with text/template filters

fbc4434

Simplifies the Matcher approach, removing the complicated Matcher types in favour of simple functions. Signed-off-by: Brian McGee <brian@bmcgee.ie>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: include and exclude files by shebang contents and other criteria #648

RFC: include and exclude files by shebang contents and other criteria #648

Uh oh!

tomeon commented Nov 9, 2025

Uh oh!

jfly left a comment

Uh oh!

tomeon commented Nov 11, 2025

Uh oh!

jfly commented Nov 11, 2025

Uh oh!

zimbatm commented Nov 14, 2025

Uh oh!

tomeon commented Nov 21, 2025

Uh oh!

Uh oh!

jfly commented Dec 2, 2025

Uh oh!

tomeon commented Dec 8, 2025

Uh oh!

jfly commented Dec 8, 2025 •

edited

Loading

Uh oh!

brianmcgee commented Dec 12, 2025

Uh oh!

brianmcgee commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RFC: include and exclude files by shebang contents and other criteria #648

Are you sure you want to change the base?

RFC: include and exclude files by shebang contents and other criteria #648

Uh oh!

Conversation

tomeon commented Nov 9, 2025

Uh oh!

jfly left a comment

Choose a reason for hiding this comment

Uh oh!

tomeon commented Nov 11, 2025

Uh oh!

jfly commented Nov 11, 2025

Uh oh!

zimbatm commented Nov 14, 2025

Uh oh!

tomeon commented Nov 21, 2025

Uh oh!

Uh oh!

jfly commented Dec 2, 2025

Uh oh!

tomeon commented Dec 8, 2025

Uh oh!

jfly commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brianmcgee commented Dec 12, 2025

Uh oh!

brianmcgee commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jfly commented Dec 8, 2025 •

edited

Loading