Skip to content

Conversation

@tomeon
Copy link

@tomeon tomeon commented Nov 9, 2025

Intended for cases like scripts, which may lack extensions indicating their file/language type.

An incidental change is the introduction of the format.Matcher interface and various types implementing this interface. The idea is to encapsulate various bits of logic:

  1. Does this matcher match the supplied path (for the matcher-specific sense of "match")?
  2. Should this matcher be ignored (e.g. because it has no configured glob patterns)?
  3. Is this matcher for including or excluding files?

I do not have very much experience in Go, so I'm not certain whether this approach is a good idea or, if so, whether it is implemented sensibly. The intentions are:

  1. To simplify the use of matchers (assembling a CompositeMatcher is a bit verbose/unwieldy, but using it is IMO straightforward -- one call to Wants(path) for each path under consideration), and
  2. To provide a basis for implementing additional inclusion/exclusion logic.

Thanks in advance for your consideration!

Copy link
Collaborator

@jfly jfly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! Have you seen the previous discussion about this in #33?

I'd prefer the approach originally proposed by @zimbatm, as it's designed to have a minimal performance impact (only actually open the file if it's executable and has no extension). Does this PR force us to read every file in a repo?

@tomeon tomeon force-pushed the mimetype-detection branch from c0e672c to 176a18f Compare November 11, 2025 12:26
@tomeon
Copy link
Author

tomeon commented Nov 11, 2025

Have you seen the previous discussion about this in #33?

D'oh 🤦. I have now; thanks :)

Does this PR force us to read every file in a repo?

I believe the answer is "no", unless the user has configured a formatter with includes = [] and an excludes (and/or global excludes) that doesn't exclude anything: the glob-based matchers are invoked before the MIME type matchers, so a given file is opened and read only if (a) it has not already been rejected by a glob matcher and (b) the relevant MIME type lists are non-empty. I've also updated this PR to include caching of detected MIME types.

[O]nly actually open the file if it's executable and has no extension

Would it make sense to implement this as an opt-out switch? E.g mimetypes-consider-all-files = true?

@jfly
Copy link
Collaborator

jfly commented Nov 11, 2025

(Full disclosure: I still haven't read the code, sorry.) But perhaps it would be useful to split the conversation into 2 questions:

  1. Which files should we read in order to apply a heuristic to determine the "file type"?
  2. What should that file contents based heuristic be?

For 1), I see no reason to consider files other than files both have no extension and are executable. Can you think of a situation this would not handle well?

For 2), I we have 2 options:

I wonder if we're better off writing something ourselves. The mimetype library is 6k+ lines of go code that I haven't read, and it handles a whole bunch of things we don't care about (image files, archive formats, etc).

@zimbatm
Copy link
Member

zimbatm commented Nov 14, 2025

I like that you just went ahead and implemented a solution.

To solve the executable matching, I believe the simplest solution would be to parse the shebangs and extract which executable will be running. With a bit of heuristics to handle #!/usr/bin/env.

In that case we could add a include_shebang = [ "ruby" ].

Would you be up for implementing something like that instead?

@tomeon tomeon force-pushed the mimetype-detection branch from 176a18f to 8e4ae84 Compare November 21, 2025 14:01
@tomeon
Copy link
Author

tomeon commented Nov 21, 2025

If matching on the parsed shebangs of extension-less executable scripts is precisely the feature set that this project is prepared to accept, then I am happy to rework this PR in the manner described:

  1. Matching such scripts is my use-case, and apparently the same is true of (or at least the same use-case has occurred to) several other folks,
  2. At least some MIME type detection suites use shebangs for detecting certain types of script, and
  3. I've done this sort of thing before.

For what it's worth, I can think of some use-cases that fail one or more of these strictures:

  1. git-sh-setup-style shell libraries that aren't executable because they are meant to be sourced rather than run as scripts, and
  2. Files like anything-sync-daemon.in that have an extension that indicates something other than their implementation language (in this case, the fact that the file requires preprocessing by the build system), which could be especially troublesome in polyglot projects where the same extension may be used by files implemented in different languages.

Of course, the fact that I can think of these use-cases is no strong indication that any current or potential treefmt users actually have them (nor is the fact that I haven't thought of other use-cases any indication that users don't have them). But I think it's worth considering the risk that eventually accommodating such use-cases might require introducing either backward incompatibilities or configuration parameters along the lines of the above-suggested mimetypes-consider-all-files (which aren't terribly ergonomic, IMO).

I'd therefore like to float an idea inspired in part by this comment on #33: using text/template filters executed in the context of walk.File structures to select and reject files. I've implemented this in the lastest force-pushed commit of this PR, along with some helper methods on walk.File like Shebang(), InterpreterName(), IsExecutable(), and LooksLikeScript() (is executable, has a shebang, has no extension), plus the template functions fnmatch (glob matching) and rematch (regexp matching).

This permits configurations like the following:

[formatter.mylanguage]
command = "mycommand"
select = ["{{ and .LooksLikeScript (rematch `^possibly-versioned-exe[[:digit:]]*$` .InterpreterName) }}"]

Please let me know what you think of this idea. Thanks!

This changeset also introduces some new methods on `walk.File`,
including a method for extracting a shebang (if present), a method for
extracting the interpreter portion of the shebang, a method indicating
whether or not the file is executable, and a convenience method that
indicates if the file "looks like" a script.
@tomeon tomeon force-pushed the mimetype-detection branch from 8e4ae84 to 7bb0563 Compare November 21, 2025 14:40
@jfly
Copy link
Collaborator

jfly commented Dec 2, 2025

Thanks for the examples! Comments inline.

For what it's worth, I can think of some use-cases that fail one or more of these strictures:

  1. git-sh-setup-style shell libraries that aren't executable because they are meant to be sourced rather than run as scripts, and

Would these even have shebangs? I wouldn't expect shell libraries to have them, I would expect them to have file extensions. The file in git itself notably doesn't have a shebang and does have an extension: https://github.com/git/git/blob/master/git-sh-setup.sh.

  1. Files like anything-sync-daemon.in that have an extension that indicates something other than their implementation language (in this case, the fact that the file requires preprocessing by the build system), which could be especially troublesome in polyglot projects where the same extension may be used by files implemented in different languages.

Good example! To extend @zimbatm's original files_executable proposal, we could make it a map from extension to shebang executables. For example:

files_executable = { "" = [ "bash" ], "*.in" = [ "bash" ] }

I'd therefore like to float an idea inspired in part by #33 (comment): using text/template filters executed in the context of walk.File structures to select and reject files

This is a very cool idea! It has some nice properties, and is very flexible. One downside is that go becomes less of an implementation detail. I'm curious what @zimbatm and @brianmcgee think. I personally would probably prefer something simpler and less flexible (like the original files_executable proposal, or the modified one above.

@tomeon tomeon changed the title RFC: include and exclude files by detecting their MIME type RFC: include and exclude files by shebang contents and other criteria Dec 8, 2025
@tomeon
Copy link
Author

tomeon commented Dec 8, 2025

Would these even have shebangs? I wouldn't expect shell libraries to have them, I would expect them to have file extensions.

The point is that not all shell scripts without extensions are executable. If you are looking for an example stored in source in this form, here's one. Such scripts may indeed have shebangs: for instance, I've used shebangs in non-executable scripts to inform ShellCheck of the target shell dialect, and to inform my editor that I want shell syntax highlighting.

To extend @zimbatm's original files_executable proposal, we could make it a map from extension to shebang executables:

This doesn't account for the fact that *.in files may not be executable. anything-sync-daemon.in is itself an example of this: it is stored with mode 0644 but installed with mode 0755.

If the maintainers of this project stipulate that executable, (possibly?) extension-less scripts are the only use-case they wish to support, then I am happy to rework this PR. But I can't agree that this is the only use-case, or the only use-case worth supporting.

@jfly
Copy link
Collaborator

jfly commented Dec 8, 2025

The whole discussion here is about how opinionated treefmt should be here. Your latest idea with go templates is an elegant way to achieve maximal flexibility, but I'd prefer treefmt take a more opinionated stance on how we expect scripts to be written. Specifically:

  1. Scripts that are intended to be executed should have a shebang and the executable bit.
  2. Libraries should have extensions that reflect the language they're written in (.bash, .zsh, .py, etc).

You're absolutely correct that this does not cover all use cases. If projects cannot (or prefer not) to structure their code this way, they can still use treefmt's include mechanism to explicitly include those exceptions (ooh would be interesting if we could put multiple treefmt.toml files in a project to scope rules by directory).

If the maintainers of this project stipulate that executable, (possibly?) extension-less scripts are the only use-case they wish to support, then I am happy to rework this PR

@tomeon, thank you for staying engaged throughout this somewhat bike-sheddy conversation. I will see if I can get the other maintainers to weigh in so we can give you a clear "path to acceptance" of this PR.

@brianmcgee
Copy link
Member

@tomeon I'd like to reiterate what @jfly said, thanks for hanging in there with this 🙏 . My focus has been elsewhere these past few weeks.

I'm going to give this some proper brain time this morning and have a look at what you implemented.

Simplifies the Matcher approach, removing the complicated Matcher types in favour of simple functions.

Signed-off-by: Brian McGee <brian@bmcgee.ie>
@brianmcgee
Copy link
Member

@tomeon, I spent some time reviewing the approach and pushed a commit that simplifies the implementation.

I like how flexible the template approach is. It's a nice progressive enhancement users can reach for when they need something more powerful than just globs.

Where I'm undecided yet is the potential for issues or user confusion resulting from the interplay of the two mechanisms and the fact that they can now get themselves into "a lot more trouble" with templates. This could be addressed with clearer logging, perhaps.

Some of what @jfly talks about also resonates.

TLDR: It's an intriguing approach; I'm undecided whether it's the right one yet. I'd like to let it marinate a bit more and get @zimbatm to weigh in, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants