Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustdoc - implement #[doc(codeblock_attr = ...)] #84597

Closed
wants to merge 2 commits into from

Conversation

jsgf
Copy link
Contributor

@jsgf jsgf commented Apr 26, 2021

This PR implements a #[doc(codeblock_attr = )] specification which allows all the codeblock attrs in a doc comment to be overridden. For example, if the doc comment has come from outside the Rust ecosystem (eg from bindgen, protoc, etc) then it may appear to have code blocks in its doc comments, but they are not intended to be in Rust. They will either fail to parse, causing spurious warnings, or worse, parse and do unintended things.

To address this, this PR allows all the code block attributes to be overridden for both ``` and indent-based code blocks (which don't currently have a way to specify attributes). For example:

#[doc = r#"
This is a random piece of code

    10 print "I don't have a closing quote
"#]
#[doc(codeblock_attr = "text")] // force indented code block to be treated as plain text
struct Foo;

I have some possible discussion points:

  • Name of codeblock_attr
  • Should there also be an "extend" counterpart. I don't have a need for it, in the use-cases I'm thinking of, but it's worth considering (not for this PR, but making sure this PR doesn't preclude a future extension).
  • module/crate-level override/default? Doing it at a per-doc comment level seems simplest, so long as you can update your code generator to emit it. Doing it at a larger scope just means updating a prolog/template.

Personally I think this PR lands on the right answers for all of these, but I'm open to other thoughts.

https://internals.rust-lang.org/t/blanket-tags-for-doc-code-blocks/14567/ for irlo discussion.

Relevant issue: #59867

cc @joshtriplett @ehuss

Restructure the loop to avoid awkward mid-loop continues which
complicate the flow control, and the duplicated code. Result is simpler
and should be functionally identical.
The `codeblock_attr` parameter for `#[doc]` overrides the codeblock
attributes - that is, the text `foo` after ```` ```foo ````.

This also allows a codeblock attribute to be applied to indentation
code blocks.

This is intended for documentation coming from non-Rust-aware sources
which may contain code examples, but are not written in Rust. Such
examples - at best - will cause compilation warnings from rustdoc,
or could compile but have unintended consequences.
@rust-highfive
Copy link
Collaborator

r? @jackh726

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 26, 2021
@jsgf
Copy link
Contributor Author

jsgf commented Apr 26, 2021

Oh, need to update the docs.

@joshtriplett
Copy link
Member

This looks great, and thank you for adding the detailed tests for an item-level attr overriding a module-level one.

@jsgf
Copy link
Contributor Author

jsgf commented Apr 26, 2021

@joshtriplett - The module-level attr only affects the module's #[doc] attribute - it doesn't have any effect on the docs within the module. I should probably add a test to make sure the module-level one doesn't affect the docs within the module.

@joshtriplett
Copy link
Member

Oh, I misread the test then. I would have expected the attribute at the module level to affect any items within the module, unless the item itself had an overriding attribute or an individual code block did. That would make this easier to apply to an entire module at once.

@jsgf
Copy link
Contributor Author

jsgf commented Apr 26, 2021

Yeah, but a module can have its own doc comment, so there's an ambiguity about whether you want to just override the module-level codeblocks or all the codeblocks within the module. That could be resolved by having a second doc attribute parameter, but that's getting more complex (esp since we'd need to define how it interacts with the per-doc parameters).

It just seemed simpler to keep it as a doc-by-doc thing. It enables the basic functionality, and could be extended if this form is too awkward in practice.

@jyn514
Copy link
Member

jyn514 commented Apr 26, 2021

Has this gone through an RFC? Rustdoc has a lot of #[doc] attributes already.

r? @GuillaumeGomez

@jyn514
Copy link
Member

jyn514 commented Apr 26, 2021

@jsgf what should happen if there are multiple codeblocks for the same item? Will this apply to all of them? If the item is re-exported, will codeblock_attr apply to comments added on the re-export?

I don't think codeblock_attr is a good name, "attribute" has a specific meaning in rust and this isn't it - I would call these "lang strings" or something. Maybe codeblock_lang is better?

@jyn514 jyn514 added T-rustdoc needs-fcp This change is insta-stable, so needs a completed FCP to proceed. labels Apr 26, 2021
Comment on lines +548 to +550
if let Some(c) = doc_codeblock_attr
.chars()
.find(|&c| c == ',' || c == '`' || c == '"' || c == '\'' || c.is_whitespace())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is used somewhere else in rustdoc - could you find it and factor out an "is_lang_string_separator" function? Getting it wrong means rustdoc can't fix it without breaking backwards compatibility.

);
return false;
}
if doc_codeblock_attr.starts_with(' ') || doc_codeblock_attr.ends_with(' ') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check will never run, you already give an error above if the lang string contains whitespace.

@jsgf
Copy link
Contributor Author

jsgf commented Apr 26, 2021

@jyn514

what should happen if there are multiple codeblocks for the same item? Will this apply to all of them? If the item is re-exported, will codeblock_attr apply to comments added on the re-export?

Yes, it overrides all the codeblocks in a given doc comment. I don't know how re-export works; I didn't test that. Does re-export also re-export the doc tests? Mechanically the override is attached to the same doc attr node that has the docs themselves, so I would imagine that they'd follow along on any re-export.

I don't think codeblock_attr is a good name, "attribute" has a specific meaning in rust and this isn't it - I would call these "lang strings" or something. Maybe codeblock_lang is better?

Yeah I agree about "attributes" but apparently that's the markdown name for them. pulldown_cmark avoids giving them a name at all other than "the language", and rustdoc internally uses "syntax". I'm happy to rename it to codeblock_lang.

@jyn514
Copy link
Member

jyn514 commented Apr 26, 2021

Yes, it overrides all the codeblocks in a given doc comment. I don't know how re-export works; I didn't test that. Does re-export also re-export the doc tests? Mechanically the override is attached to the same doc attr node that has the docs themselves, so I would imagine that they'd follow along on any re-export.

Yes, any re-export inherits all the attributes of the original item, see merge_attrs:

fn merge_attrs(

But anyway my question isn't how it currently behaves - re-exports are the most complicated part of rustdoc and I fully expect the first draft to be wrong - my question is how you think it should behave.

@GuillaumeGomez
Copy link
Member

I'm shared on this: on one side, I'd say to simply re-format the text before putting it into your rust code, that's up to you, not to rustdoc to fix your doc comment. On another, I can see how it could be used, so let's say that for now, I'm skeptical but not completely opposed to this feature.

Now a few issues remain here. @jyn514 already underlined the issue in case you have multiple code blocks in one doc comment. In this case, it's all or nothing and I don't see how to go around that limitation.

Another problem to be seen is: is it really that useful? It feels like we're using a bazooka to kill a fly here. Markdown provides already everything you need if you don't want your code to be interpreted as a rust one. It's being done everywhere, so why not in your case? You can easily overload bindgen output after all.

And yes, I think it'd need to go at least through an RFC here to see if the rest of the team is up with this feature. Because that's also something to be taken into consideration: new feature means more code to be supported, tested and kept up-to-date. Considering how small the gain is, I'd personally say it's not worth it, but if people think otherwise, why not. Hence the need to go through an RFC first here.

@jyn514
Copy link
Member

jyn514 commented Apr 26, 2021

Another problem to be seen is: is it really that useful? It feels like we're using a bazooka to kill a fly here. Markdown provides already everything you need if you don't want your code to be interpreted as a rust one. It's being done everywhere, so why not in your case? You can easily overload bindgen output after all.

Actually yeah, why is this the easiest solution? Couldn't you do the equivalent of re.sub(docs, "^ ", " ") in your build.rs? (edit: github is mangling my spaces, but there are 4 in the first arg and 3 in the second)

@jsgf
Copy link
Contributor Author

jsgf commented Apr 26, 2021

@jyn514 @GuillaumeGomez:

OK, so my use case is this:

  • I'm specifically focusing on the fbthrift compiler, which generates the Rust code via moustache templates: https://github.com/facebook/fbthrift/blob/master/thrift/compiler/generate/templates/rust/lib.rs.mustache (this doesn't currently have #[doc] support in it, but it will land shortly).
  • The build is being driven by Buck in our case, so build.rs isn't involved. However for Cargo builds it would probably use a build.rs, and I'd want it to work in a uniform way regardless.
  • Thrift supports many languages, and the formatting of the doc comments, uh, reflects that diversity, so unfortunately there's no uniform syntax or markup language used. But enough of it is markdown-ish that treating it that way often gives reasonable results. Either way, even if it is markdown with code blocks, they're almost certainly not Rust.

In the common case in Rust code, the docs are being written for that specific piece of Rust code, so any example or doctest will also be in Rust. But in this case, we're effectively importing documentation from an external source, which only incidentally supports Rust. The goal here is to have a uniform way to neutralize any unintended consequences of doing this. At the very least it could result in spurious compiler warnings about malformed code, but it could extend to a possible attack vector (nobody would expect a thrift or protobuf file could result in arbitrary code, but a doc comment and associated doctests allow arbitrary code to be injected).

I'd really like to avoid any kind of ad-hoc regex approach - if nothing else I don't see how one could reliably implement it. You'd have to at least tokenize the .rs file to get the full docstring text, then apply the regex to it, and then regenerate the source with the updated text. I don't see how a re.sub(...) approach would work directly on the source file itself given all the possible ways of expressing the same doc comment (several kinds of cooked comment format, and numerous string literal formats and ways of quoting newlines).

In my case, in principle the thrift compiler could parse the comment as markdown while generating the Rust code and do the cleanups there. But in practice this is awkward because it's written in C++ and there's no guarantee that its markdown implementation would match Rust's.

So I do think this is the right place to implement this feature. But I'm open to suggestions about details.

@jsgf
Copy link
Contributor Author

jsgf commented Apr 26, 2021

@jyn514

But anyway my question isn't how it currently behaves - re-exports are the most complicated part of rustdoc and I fully expect the first draft to be wrong - my question is how you think it should behave.

I think the codeblock_lang parameter should just stick with their corresponding docs regardless of how they're re-exported. Since there's no inheritance mechanism, there are no ambiguities about they should be resolved (as there would be if there were crate or module-wide overrides).

@jyn514
Copy link
Member

jyn514 commented Apr 26, 2021

I think the codeblock_lang parameter should just stick with their corresponding docs regardless of how they're re-exported. Since there's no inheritance mechanism, there are no ambiguities about they should be resolved (as there would be if there were crate or module-wide overrides).

Sorry, I don't follow - do you mean that re-exporting them shouldn't apply to the new docs, or that it should?

@jyn514
Copy link
Member

jyn514 commented Apr 26, 2021

Ok, looking at #59867, the attribute suggested there is much more limited in scope:

So what I need is an opt-in flag that you can declare module-wide (and children modules and such) so that any code block without a declared language is not treated as rust code by default, so then it won't become a doctest.

I think an attribute is better so you can have it in code, so maybe something like #![doc(default_lang = "text")] would work. Making this general and customizable per-item just seems like a headache. If you still want to opt-in to doc-tests for some items, you can use ```rust like normal.

@jsgf
Copy link
Contributor Author

jsgf commented Apr 26, 2021

What I'm saying is that if you have something like:

#[doc = r#"
Random doc from unknown source
```
hi_some_random_code()
```
"#]
#[doc(codeblock_lang = "text")]
pub struct SomeImportantStruct { ... }

then if you re-export SomeImportantStruct and it's associated documentation, then the #doc[codeblock_lang = "text"] should be considered an intrinsic part of the documentation wherever it ends up. (Though if I've misunderstood what re-exporting docs means then please point me in the right direction. I can see how re-exporting docs as docs makes sense, but it's unclear to me whether it makes sense to talk about re-exporting a doctest.)

My concern with any kind of module or crate-level #[doc(default_lang = ...)] is that I'm not sure how that would interact with re-exporting - should it only apply to the original doc, or should it follow the re-export? If it's re-exported to a context with a different default_lang, which should take precedence?

We can definitely come up with something reasonable, but it seems like more up-front complexity. I prefer the clarity of explicitly tagging each #[doc] attribute, since it gives us the basic capability in a relatively simple to explain (and implement) way. If it turns out that this is a useful capability but it's awkward to apply to every doc attribute, then we can extend it with a defaulting mechanism later.

@jyn514
Copy link
Member

jyn514 commented Apr 26, 2021

My concern with any kind of module or crate-level #[doc(default_lang = ...)] is that I'm not sure how that would interact with re-exporting - should it only apply to the original doc, or should it follow the re-export? If it's re-exported to a context with a different default_lang, which should take precedence?

The attribute from the original crate should take precedence, the same way intra-doc links when re-exported use the scope they were originally written in: https://doc.rust-lang.org/nightly/rustdoc/linking-to-items-by-name.html#warnings-re-exports-and-scoping. So for example this would not run any doctest:

// crate1
#![doc(default_lang = "text")]
pub struct S {}

// crate2
```
assert!(true);
```
pub use crate1::S;

if you re-export SomeImportantStruct and it's associated documentation, then the #doc[codeblock_lang = "text"] should be considered an intrinsic part of the documentation wherever it ends up.

Sorry, my question was unclear. That behavior looks right to me, but it wasn't what I was asking about, I meant this case:

mod inner {
    /// ```
    /// assert!(true);
    /// ```
    pub fn f() {}
}

#[doc(codeblock_lang = "text")]
/// ```
/// this is invalid syntax
/// ```
pub use inner::f; // should the assert!(true) run or not?

I prefer the clarity of explicitly tagging each #[doc] attribute

It's impossible to tag each attribute, because that means you'd have to write each attribute as #[doc = "..."]. I think you meant each item.

If it turns out that this is a useful capability but it's awkward to apply to every doc attribute, then we can extend it with a defaulting mechanism later.

That seems really complicated, rustdoc already has enough bugs around re-exports. If we limit this to a crate-level setting, then it's simple: use the default of the crate the item was defined in.

@jsgf
Copy link
Contributor Author

jsgf commented Apr 27, 2021

Hm, I'm not sure I really understand what the doc comment on the use statement really does - I couldn't get it to appear in the generated docs. But it does emit a doctest, which seems separate and distinct from the one on the mod inner doctest, so I would say that its codeblock_attr only applies to it, and not the inner one.

That seems really complicated, rustdoc already has enough bugs around re-exports. If we limit this to a crate-level setting, then it's simple: use the default of the crate the item was defined in.

Hm, well I'll think about it and maybe give the PR a new spin. Crate-level seems too coarse, but maybe module level.

@jyn514
Copy link
Member

jyn514 commented Apr 27, 2021

I would rather not do module level because modules can also be re-exported. It also introduces ambiguity about whether the attribute applies to all items in the module or just the module itself.

@GuillaumeGomez
Copy link
Member

@jyn514 @GuillaumeGomez:

OK, so my use case is this:

* I'm specifically focusing on the fbthrift compiler, which generates the Rust code via moustache templates: https://github.com/facebook/fbthrift/blob/master/thrift/compiler/generate/templates/rust/lib.rs.mustache (this doesn't currently have `#[doc]` support in it, but it will land shortly).

* The build is being driven by Buck in our case, so build.rs isn't involved. However for Cargo builds it would probably use a build.rs, and I'd want it to work in a uniform way regardless.

* Thrift supports _many_ languages, and the formatting of the doc comments, uh, reflects that diversity, so unfortunately there's no uniform syntax or markup language used. But enough of it is markdown-ish that treating it that way often gives reasonable results. Either way, even if it is markdown with code blocks, they're almost certainly not Rust.

...

That definitely sounds like an issue on your side. You can always use a markdown parser on your side if you don't want to rely on regex. But again, that brings a lot of changes on rustdoc for a very specific issue that I'm pretty sure isn't common.

@jyn514
Copy link
Member

jyn514 commented Apr 27, 2021

Again, I think this should go through an RFC first. As we've found yesterday and today, this is a very large design space and I'd like to hear input from other potential users.

Hm, I'm not sure I really understand what the doc comment on the use statement really does - I couldn't get it to appear in the generated docs.

Oh ugh, I just checked and you're right, it works with cross-crate re-exports but not intra-crate re-exports. I'll open an issue.

(I think this reinforces my point that re-exports are the most complicated part of rustdoc.)

@Mark-Simulacrum
Copy link
Member

It does seem like at least the warnings rustdoc currently emits should be allowable with an attribute (including at the crate level). It may be reasonable for rustdoc to not treat code blocks implicitly as rust code if they don't parse as such, too, including not running doc tests.

The case of valid rust code that gets run and is not intended to be run (e.g., deleting important files or whatever) seems much less "real"; but even so, could be mitigated with existing cfgs it sounds like.

@jyn514
Copy link
Member

jyn514 commented May 4, 2021

It does seem like at least the warnings rustdoc currently emits should be allowable with an attribute (including at the crate level).

#84587

@bors
Copy link
Contributor

bors commented May 4, 2021

☔ The latest upstream changes (presumably #84707) made this pull request unmergeable. Please resolve the merge conflicts.

@jyn514
Copy link
Member

jyn514 commented May 18, 2021

@jsgf do you still think this is useful now that #84587 has been merged?

@crlf0710 crlf0710 added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jun 5, 2021
@crlf0710
Copy link
Member

crlf0710 commented Jun 5, 2021

@jsgf Ping from triage! Any updates on this?

@jyn514 jyn514 added T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. and removed T-rustdoc-temp labels Jun 22, 2021
@crlf0710
Copy link
Member

crlf0710 commented Jul 9, 2021

@jsgf I'm gonna close this due to inactivity. Feel free to reopen or create a new pr when you've got time to work on this again. Thanks!

@crlf0710 crlf0710 added S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jul 9, 2021
@crlf0710 crlf0710 closed this Jul 9, 2021
@jsgf
Copy link
Contributor Author

jsgf commented Jul 9, 2021

@crlf0710 Yeah, I think this is stalled on needing an RFC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-fcp This change is insta-stable, so needs a completed FCP to proceed. S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants