-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
format doctests in docstrings #8811
Conversation
|
crates/ruff_python_formatter/resources/test/fixtures/ruff/docstring_code_examples.py
Show resolved
Hide resolved
crates/ruff_python_formatter/resources/test/fixtures/ruff/docstring_code_examples.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks great, and I appreciate the way you broke it down by commit. I ended up reading the commits one-by-one and found the flow very clear.
cb98df2
to
9d466b9
Compare
I ran the docstring code formatter on CPython (after formatting the code without docstring code formatting enabled) and put the results in one commit: BurntSushi/cpython@19d1c85 I did a quick pass through them and everything looks like it passes the eyeball test. My next step is to make sure the doctests still run and pass. |
I checked out the
Then confirmed that I could run some doctests:
And some of that failed, but the Then I checked out my branch with the reformatted code snippets and re-ran the above. The results remain unchanged. So at least in that one specific case that looked potentially odd, reformatting didn't break the doctest. (Specifically, the doctest directive in this example is still picked up.) I haven't been able to figure out how to run all of the doctests yet though. |
@@ -12,6 +16,10 @@ use ruff_python_ast::{self as ast, Expr, Stmt}; | |||
/// between `class C: ...` and `class C(): ...`, which is part of our AST but not `CPython`'s. | |||
/// - Normalize strings. The formatter can re-indent docstrings, so we need to compare string | |||
/// contents ignoring whitespace. (Black does the same.) | |||
/// - The formatter can also reformat code snippets when they're Python code, which can of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the AST normalizer to remove anything that looks like a
code snippet.
Could we instead remove docstrings from the equality check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could, but I think that would potentially reduce the effectiveness of the test. Today, we still test that the substance of the docstring remains the same (modulo whitespace and code snippets). But if we just removed docstrings entirely, we wouldn't be checking anything.
I don't feel too strongly personally.
This changes the AST normalizer to remove anything that looks like a code snippet. The idea here is that, at least in tests, we want to check that reformatting Python code does *not* change its AST. Well, it is supposed to change its AST in some ways, and this normalizer tries to erase those changes. It turns out that reformatting code examples in docstrings will change the AST in even more profound ways. Namely, it can arbitrarily rewrite docstring contents. Because of this, it doesn't really seem feasible to normalize the strings in any way other than to remove anything that looks like a code snippet.
This adds a new internal-only knob to enable formatting of code snippets inside of docstrings.
And also update any extant tests that this new option impacts.
This adds some additional information to test failures that occur when formatting is not idempotent. Specifically, we want to see what formatting options were used.
This augments the Python's formatter context to include state about a docstring's quote style. Namely, the quote style refers to the kind of quotes used for a docstring that contains a code example that is currently being formatted. This state will allow us to choose the correct quote style to use while reformatting code snippets and will avoid writing invalid Python (in most cases).
This splits the docstring function for printing individual lines into a bulkier abstraction, but doesn't otherwise change anything. The idea here is to give the line printing a little more space to breathe for supporting code snippet reformatting. Namely, we will want line printing to have some kind of state for aggregate code snippets to reformat before printing.
This adds a few types for describing some data that helps facilitate formatting code snippets in docstrings. Basically, we have lines, code example lines, and different types of code examples. A passing familiarity with these types will help grok subsequent commits.
This commit adds a small but central component to code snippet formatting in docstrings: it specifically implements the state transitions needed to recognize and collect code snippets from doctests. This means looking for PS1 and PS2 prompts and extracting the code portion of each line. This also introduces a "code example add action" which we will use in a subsequent commit to control the higher level docstring line printer.
This connects all of the pieces together from the previous commit and makes the docstring line printer reformat doctest code snippets. This also includes a new (and possibly first?) recursive call to the formatter, so extra scrutiny there is most appreciated.
It looks like no previous snapshot tests have configured the line ending to anything other than the default, so this wasn't included in the test output. But we would actually like to try and test that line endings are correctly preserved when reformatting code snippets, so we make the option visible in the snapshot.
This test ensures that CRLF line endings are set correctly within a reformatted code snippet.
9d466b9
to
47e6e76
Compare
I've split the commit that adds a user facing config option out into #8854 so that we should be able to merge this PR without any user facing changes. |
All righty, I've filed issues for problems that probably aren't blocking this PR being merged:
I'm going to go ahead and merge this PR. I'd be happy to address any other feedback folks might have! |
This turns `string` into a parent module with a `docstring` sub-module. I arranged things this way because there are parts of the `string` module that the `docstring` module wants to know about (such as a `NormalizedString`). The alternative I think would be to make `docstring` a sibling module and expose more of `string`'s internals. I think I overall like this change because it gives docstring handling a bit more room to breath. It has grown quite a bit with the addition of code snippet formatting. [This was suggested by @charliermarsh.](#8811 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing the excellent documentation and organizing your commits so thoughtfully. It helped a lot when reviewing the PR (also to find savepoints when being interrupted)
Really impressive work. Well done. I've a few smaller comments. The most important one from a correctness point of view is the handling of newlines inside of multiline strings.
I think I'll learn many new English words from your commit messages 😆. So far:
- rejigger
- grok
let indent = indent.to_string(); | ||
let code = code.to_string(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for converting these to String
rather than storing borrowed &str
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to my other answer, at least with respect to code
, I think when I was writing the types it wasn't obvious to me that the code portion of a line would necessarily be a proper substring of it. But I can't think of any contrary case.
// non-doctest line. | ||
// | ||
// [1]: https://github.com/python/cpython/blob/0ff6368519ed7542ad8b443de01108690102420a/Lib/doctest.py#L733 | ||
if ps1_indent != ps2_indent { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addmit I haven't read the code in full, but would it be sufficient to check if the indents have the same column/character length as done in the Lexer or is it necessary that the strings match exactly?
See Indentation
in our lexer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Python doctest
module actually requires that the indent be made purely of ASCII space characters. Then it somewhat circuitously takes the length of the indentation in characters, reconstitutes the indentation as a string and then checks that all PS2 prompt lines have the same indentation prefix: https://github.com/python/cpython/blob/0ff6368519ed7542ad8b443de01108690102420a/Lib/doctest.py#L727-L733
Technically we are a little more relaxed here because we allow any kind of whitespace in the indentation. But I preserved the "indentation should be byte-for-byte equivalent" check.
So bottom line is that it's hard for me to say whether this is necessary or not. Probably not. But I do think it is pretty simple? Are you concerned about the costs of carrying the indentation string around? I think for code snippet formatting it is probably a lot less of a concern than when parsing arbitrary Python. The size scales are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mainly trying to understand the semantics to ensure we are as strict as Python when it comes to handling indentation. Thanks for the explanation.
.iter() | ||
.map(|line| &*line.code) | ||
.collect::<Vec<&str>>() | ||
.join("\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about line endings inside of multiline strings? Ideally, we would preserve these. Although we may deliberately decide not to and encourage users to instead use explicit line endings. But changing the line endings inside of strings is theoretically a semantic change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Charlie had a similar comment above. The thinking here is that while I'm using \n
to assemble a code snippet, that snippet is then fed into the formatter and the line endings used correspond to the user's setting. There is a test for example that this writes out CRLF line endings in code snippets when the user has configured CRLF line endings.
Did I understand you concern correctly here? I feel like I might be missing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there are places in the docstring handling (that existed before my changes) that seem to assume line feed terminators? For example:
ruff/crates/ruff_python_formatter/src/expression/string/docstring.rs
Lines 237 to 238 in 2ade84a
// We know that the normalized string has \n line endings. | |
self.offset += line.line.text_len() + "\n".text_len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think the problem is not new due to your changes. It only became evident that this might now be a problem when thinking about code snippets in docstrings (it shouldn't be a problem for non-code text).
a = """A multiline string
with windows line endings
"""
b = c
My concern is (was) that changing the newlines inside of the multiline string could be a semantic change. I tried to read through the Python documentation and only found:
In triple-quoted literals, unescaped newlines and quotes are allowed (and are retained), except that three unescaped quotes in a row terminate the literal. (A “quote” is the character used to open the literal, i.e. either ' or ".) source
What I understand from the spec is that newlines are retained. Although running the above in a print
on my mac normalizes the newlines to \n
.
I then tested what the about for
a = b"""A multiline string
with windows line endings
"""
from binascii import hexlify
print(hexlify(a))
and it seems python normalizes the newlines even for binary strings to \n
. So the retain only means that the newlines are preserved but not in the form they're present in the source code.
So I guess this is not a problem after all (and ruff and Black both normalize newlines inside multiline strings)
// a docstring. As we fix corner cases over time, we can perhaps | ||
// remove this check. See the `doctest_invalid_skipped` tests in | ||
// `docstring_code_examples.py` for when this check is relevant. | ||
let wrapped = match self.quote_style { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we short-circuit if the source string doesn't contain any triple quotes (or escaped triple quotes)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think you mentioned this in the tracking issue for this. I think you're correct. And if so, yes, I believe we could short circuit. I'm not sure though if we want to keep this check as-is here for now as a conservative posture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. My only concern is performance because our parser isn't very fast today (I believe it's about 30% or more of the overall formatting time). An alternative could be to only lex the code and see if there are any lexer errors. But I don't know if that's sufficient to detect the kind of errors that could be introduced.
But agree, I don't think it's a high priority. Just trying to brainstorm a few ideas with the hope that there might be a low hanging perf improvement.
If a test failure occurs, this at least makes it a little clearer that a code snippet has been removed. Ref: #8811 (comment)
This PR contains a few small clean-ups that are responses to @MichaReiser's review of my #8811 PR.
If a test failure occurs, this at least makes it a little clearer that a code snippet has been removed. Ref: #8811 (comment)
If a test failure occurs, this at least makes it a little clearer that a code snippet has been removed. Ref: #8811 (comment)
This PR does the plumbing to make a new formatting option, `docstring-code-format`, available in the configuration for end users. It is disabled by default (opt-in). It is opt-in at least initially to reflect a conservative posture. The intent is to make it opt-out at some point in the future. This was split out from #8811 in order to make #8811 easier to merge. Namely, once this is merged, docstring code snippet formatting will become available to end users. (See comments below for how we arrived at the name.) Closes #7146 ## Test Plan Other than the standard test suite, I ran the formatter over the CPython and polars projects to ensure both that the result looked sensible and that tests still passed. At time of writing, one issue that currently appears is that reformatting code snippets trips the long line lint: https://github.com/BurntSushi/polars/actions/runs/7006619426/job/19058868021
@JacobCoffee Yeah I think those issues are about applying |
Summary
This PR adds opt-in support for formatting doctests in docstrings. This reflects initial support and it is intended to add support for Markdown and reStructuredText Python code blocks in the future. But I believe this PR lays the groundwork, and future additions for Markdown and reST should be less costly to add.
It's strongly recommended to review this PR commit-by-commit. The last few commits in particular implement the bulk of the work here and represent the denser portions.
Some things worth mentioning:
Closes #7146
Test Plan