-
-
Notifications
You must be signed in to change notification settings - Fork 779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documenting minimizing diffs for ease of review #1018
Comments
SemBr minimizes this, by ensuring lines will be broken semantically, like in the idealized beginning snippet, rather as if the second code snippit was the initial state, in which case the diffs would invariably be non-minimal (unless you let the source text be much wider than 80 columns, which is a third strategy to deal with this). However, not everyone likes it, and converting existing text to it would just be a lot of churn, unless it was being modified already, so I don't think it's feasible at the moment to specify it (except, maybe, as a non-binding recommendation for new/modified text, if there's consensus). Given there is no concretely defined, much less enforced breaking style, I'm not sure if we can prescriptively specify a single simple strategy for minimizing diffs that will give optimal results in different circumstances. Unless we're going to do so (at least as a recommendation), then perhaps we should just state that line diffs due to hard rewrapping should generally be minimized/avoided when practical, leaving the choice of strategy (SemBr, breaks or relaxing the line width) to the discretion of the reviewer. If, indeed, it needs to be stated.
IMO, but the second example is far "uglier" than the third, as logical sentences are arbitrarily splices across physical source lines. Wrapping is useful in rendered output, but I'd rather have the source (and diffs) reflect the semantic structure |
This was recently discussed at https://discuss.python.org/t/semantic-line-breaks/13874?u=hugovk and I think the consensus was that it should be allowed but not required. Some expressed concern that it's important that documents are still readable as plaintext, and that there should be no requirement to maintain it in the future so text may be edited and reflowed later. |
In general, IMO the minimum guidance here would be to minimize or avoid reflow on unmodified lines when making changes. Using SemBr for the original text avoids this by default, and is a logical breaking strategy for new/modified text that can be used in other cases when adding breaks to avoid reflow, so it could be mentioned as a suggestion in that limited context. However, as mentioned, I don't think we need to prescriptively require any particular strategy as opposed to just specifying that reflow of unmodified lines should be minimized/avoided to the extent practical. |
Text expresses ideas or thoughts with words in a hierarchical structure. Editing ranges from replacing a single character to major replacements of thoughts and structure. The discussion referenced 2 posts above by Hugo included a link to Semantic Line Breaks. I think it pretty good. The key idea is that putting 'one substantial unit of thought' on each line facilitates the common editing operation of re-expressing a unit of thought and perhaps altering it, by localizing the change. To me, the phrase 'semantic line breaks' emphasizes too much the terminators (line breaks) rather than the line content (units of thought). (Note that 'semantics' and 'syntax' are not commonly used outside of communities focused on language structure and meaning.) |
The main motivation for this would just be for ease of reviewing - if the committer decides that the sentences ought to be reflowed to whatever standard instead of minimizing diffs, that'd be fine too. I think it's easier to start with min-diff and reflow if requested, instead of reflowing then changing to min-diff. As an aside, if we have a case like this:
Then if it's a one-word change, I think since there's no guarantee on what format the original text is in, it'd be reasonable to recommend minimizing the diff
over something like a blend of semantic breaking and min-diff, despite it reading a bit better:
Skip, in the thread Hugo posted, raised a good point in that we should separate PRs with content changes and formatting changes. Minimizing diffs ensures that the content changes shine through, and if the raw text starts to become difficult to parse then a PR can be opened that solely reformats the raw text with no content changes. In the end, this would just be a recommendation and not a standard that needs to be followed, and another benefit to minimizing diffs is it keeps a somewhat better commit history to go back to as well. |
To note, that seems to be a rather contrived example, as with arbitrary max-length hard breaks, it is very unlikely that any given change can be cleanly added to it's own line without modifying any existing lines. That would require both that the change be solely an insertion and not include any modification or deletion of existing text, and that the insertion occur between the two words that happen to fall on the arbitrary max-length breakpoint. If we assume the original text had fixed hard-breaks at the standard 79 characters instead of ≈100:
We have a much more typical case when we make the same insertion of Add it to the same lineThis minimizes the diff, at the expense of going over max-length:
Diff outputdiff --git a/test_file.txt b/test_file.txt
index bd0a68605c..b8e03ec30e 100644
--- a/test_file.txt
+++ b/test_file.txt
@@ -1,4 +1,4 @@
First chunk of a very long running sentence, with lines broken at max-length
-to lead into the second chunk of a very long running sentence, with lines
+to lead into the aforementioned second chunk of a very long running sentence, with lines
still broken at max-length into the third chunk of a very long running
sentence. Break in the middle of the line somewhereThis is the minimal diff possible if still keeping max-length, but still much larger than the previous (especially with "smart" diffs, like on GitHub), and looks more odd/arbitrary.
Diff outputdiff --git a/test_file.txt b/test_file.txt
index bd0a68605c..c2cb2d183e 100644
--- a/test_file.txt
+++ b/test_file.txt
@@ -1,4 +1,5 @@
First chunk of a very long running sentence, with lines broken at max-length
-to lead into the second chunk of a very long running sentence, with lines
+to lead into the aforementioned second chunk of a very long running sentence,
+with lines
still broken at max-length into the third chunk of a very long running
sentence. (Of course, we can also reflow to SemBr or the existing arbitrary hard breaks, which is the most pleasing formatting, and in the case of SemBr avoids this problem in the future, but that would be the largest diff). FWIW, I would probably do the first option if I was stuck in this situation, as it minimizes the diff and doesn't overflow by that much, which is less visually bothersome than the break anyway. With SemBrThe main point with SemBr, for text we're writing, rewriting, etc. anyway, is that it mostly avoids having this problem in the first place, and minimizes it if it does happen. It places its breaks at the most semantically meaningful places where insertions are most likely to occur, groups lines semantically and thus localizing changes, allows more "wiggle room" for lines to grow and shrink, and making it straightforward to break a line if needed, and it not affect any others. If this block was broken with SemBr:
The change would apply cleanly, with a minimal diff like the first example, while still remaining within the max-width like the second, and unlike either having the changed line be one and only one atomic semantic unit of text:
Diff outputdiff --git a/test_file.txt b/test_file.txt
index 1714376555..628fd1a537 100644
--- a/test_file.txt
+++ b/test_file.txt
@@ -1,5 +1,5 @@
First chunk of a very long running sentence,
with lines broken at max-length to lead into
-the second chunk of a very long running sentence,
+the aforementioned second chunk of a very long running sentence,
with lines still broken at max-length into
the third chunk of a very long running sentence. Of course, just to be repeat, this demonstrates the benefits of SemBr as a wrapping strategy in minimizing all future reflows and diffs when writing, rewriting or reflowing text; with existing hard-wrapped text in this situation, unless the modifications are comprehensive enough that a reflow is justified, the options are either relax the line length limit, or add an arbitrary break (though SemBr can help determine the most logical location for that latter break). |
Issue #13 also references unnecessary code churn. I'm closing that in preference of this issue. |
FWIW, when I edit documentation I tend to follow a few guidelines:
Footnotes
|
Those are good observations, Ezio. +1 from me. |
@willingc, regarding SemBr, see CAM's #1018 (comment). |
Regarding SemBr: I usually reflow modified lines according to SemBr in my own PRs, and I recommend it when reviewing. Of course, I try to write new documentation using the SemBr style, and I think that is a useful goal to strive for. |
Suppose we have some lines that are close to 80 chars long
and if the first line needs some sort of modification that would push it past 80 chars, we can either restructure the entire paragraph of raw rst so it looks tidier, but resulting in 4 lines of git diffs
or diffs can be minimized (but with an 'uglier' raw rst), such as like
I think that for ease of reviewing, as well as to maintain a more useful git history, that the second is preferred in general but I haven't found it in the style guide, so maybe it can be added there if there's consensus on it?
The text was updated successfully, but these errors were encountered: