Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documenting minimizing diffs for ease of review #1018

Open
slateny opened this issue Jan 1, 2023 · 12 comments
Open

Documenting minimizing diffs for ease of review #1018

slateny opened this issue Jan 1, 2023 · 12 comments
Labels
enhancement guide-new content Additions; New content or section needed needs: decision Needs consensus decision from core devs

Comments

@slateny
Copy link
Contributor

slateny commented Jan 1, 2023

Suppose we have some lines that are close to 80 chars long

first text ... end first text.
second text ... end second text.
third text ... end third text.

and if the first line needs some sort of modification that would push it past 80 chars, we can either restructure the entire paragraph of raw rst so it looks tidier, but resulting in 4 lines of git diffs

first text plus some more ...
end first text. second text ...
end second text. third text ...
end third text.

or diffs can be minimized (but with an 'uglier' raw rst), such as like

first text plus some more ...
... end first text.
second text ... end second text.
third text ... end third text.

I think that for ease of reviewing, as well as to maintain a more useful git history, that the second is preferred in general but I haven't found it in the style guide, so maybe it can be added there if there's consensus on it?

@CAM-Gerlach
Copy link
Member

SemBr minimizes this, by ensuring lines will be broken semantically, like in the idealized beginning snippet, rather as if the second code snippit was the initial state, in which case the diffs would invariably be non-minimal (unless you let the source text be much wider than 80 columns, which is a third strategy to deal with this). However, not everyone likes it, and converting existing text to it would just be a lot of churn, unless it was being modified already, so I don't think it's feasible at the moment to specify it (except, maybe, as a non-binding recommendation for new/modified text, if there's consensus).

Given there is no concretely defined, much less enforced breaking style, I'm not sure if we can prescriptively specify a single simple strategy for minimizing diffs that will give optimal results in different circumstances. Unless we're going to do so (at least as a recommendation), then perhaps we should just state that line diffs due to hard rewrapping should generally be minimized/avoided when practical, leaving the choice of strategy (SemBr, breaks or relaxing the line width) to the discretion of the reviewer. If, indeed, it needs to be stated.

or diffs can be minimized (but with an 'uglier' raw rst), such as like

IMO, but the second example is far "uglier" than the third, as logical sentences are arbitrarily splices across physical source lines. Wrapping is useful in rendered output, but I'd rather have the source (and diffs) reflect the semantic structure

@hugovk
Copy link
Member

hugovk commented Jan 1, 2023

This was recently discussed at https://discuss.python.org/t/semantic-line-breaks/13874?u=hugovk and I think the consensus was that it should be allowed but not required.

Some expressed concern that it's important that documents are still readable as plaintext, and that there should be no requirement to maintain it in the future so text may be edited and reflowed later.

@CAM-Gerlach
Copy link
Member

In general, IMO the minimum guidance here would be to minimize or avoid reflow on unmodified lines when making changes. Using SemBr for the original text avoids this by default, and is a logical breaking strategy for new/modified text that can be used in other cases when adding breaks to avoid reflow, so it could be mentioned as a suggestion in that limited context. However, as mentioned, I don't think we need to prescriptively require any particular strategy as opposed to just specifying that reflow of unmodified lines should be minimized/avoided to the extent practical.

@terryjreedy
Copy link
Member

Text expresses ideas or thoughts with words in a hierarchical structure. Editing ranges from replacing a single character to major replacements of thoughts and structure. The discussion referenced 2 posts above by Hugo included a link to Semantic Line Breaks. I think it pretty good. The key idea is that putting 'one substantial unit of thought' on each line facilitates the common editing operation of re-expressing a unit of thought and perhaps altering it, by localizing the change. To me, the phrase 'semantic line breaks' emphasizes too much the terminators (line breaks) rather than the line content (units of thought). (Note that 'semantics' and 'syntax' are not commonly used outside of communities focused on language structure and meaning.)

@slateny
Copy link
Contributor Author

slateny commented Jan 11, 2023

The main motivation for this would just be for ease of reviewing - if the committer decides that the sentences ought to be reflowed to whatever standard instead of minimizing diffs, that'd be fine too. I think it's easier to start with min-diff and reflow if requested, instead of reflowing then changing to min-diff.

As an aside, if we have a case like this:

First chunk of a very long running sentence, with lines broken at max-length to lead into the
second chunk of a very long running sentence, with lines still broken at max-length into the
third chunk of a very long running sentence.

Then if it's a one-word change, I think since there's no guarantee on what format the original text is in, it'd be reasonable to recommend minimizing the diff

First chunk of a very long running sentence, with lines broken at max-length to lead into the
aforementioned
second chunk of a very long running sentence, with lines still broken at max-length into the
third chunk of a very long running sentence.

over something like a blend of semantic breaking and min-diff, despite it reading a bit better:

First chunk of a very long running sentence, with lines broken at max-length to lead into the
aforementioned second chunk of a very long running sentence,
with lines still broken at max-length into the
third chunk of a very long running sentence.

Skip, in the thread Hugo posted, raised a good point in that we should separate PRs with content changes and formatting changes. Minimizing diffs ensures that the content changes shine through, and if the raw text starts to become difficult to parse then a PR can be opened that solely reformats the raw text with no content changes.

In the end, this would just be a recommendation and not a standard that needs to be followed, and another benefit to minimizing diffs is it keeps a somewhat better commit history to go back to as well.

@CAM-Gerlach
Copy link
Member

CAM-Gerlach commented Jan 14, 2023

Then if it's a one-word change, I think since there's no guarantee on what format the original text is in, it'd be reasonable to recommend minimizing the diff

First chunk of a very long running sentence, with lines broken at max-length to lead into the
aforementioned
second chunk of a very long running sentence, with lines still broken at max-length into the
third chunk of a very long running sentence.

To note, that seems to be a rather contrived example, as with arbitrary max-length hard breaks, it is very unlikely that any given change can be cleanly added to it's own line without modifying any existing lines. That would require both that the change be solely an insertion and not include any modification or deletion of existing text, and that the insertion occur between the two words that happen to fall on the arbitrary max-length breakpoint.

If we assume the original text had fixed hard-breaks at the standard 79 characters instead of ≈100:

First chunk of a very long running sentence, with lines broken at max-length
to lead into the second chunk of a very long running sentence, with lines
still broken at max-length into the third chunk of a very long running
sentence.

We have a much more typical case when we make the same insertion of aforementioned, where we either have to either:

Add it to the same line

This minimizes the diff, at the expense of going over max-length:

First chunk of a very long running sentence, with lines broken at max-length
to lead into the aforementioned second chunk of a very long running sentence, with lines
still broken at max-length into the third chunk of a very long running
sentence.

Diff output
diff --git a/test_file.txt b/test_file.txt
index bd0a68605c..b8e03ec30e 100644
--- a/test_file.txt
+++ b/test_file.txt
@@ -1,4 +1,4 @@
 First chunk of a very long running sentence, with lines broken at max-length
-to lead into the second chunk of a very long running sentence, with lines
+to lead into the aforementioned second chunk of a very long running sentence, with lines
 still broken at max-length into the third chunk of a very long running
 sentence.

Break in the middle of the line somewhere

This is the minimal diff possible if still keeping max-length, but still much larger than the previous (especially with "smart" diffs, like on GitHub), and looks more odd/arbitrary.

First chunk of a very long running sentence, with lines broken at max-length
to lead into the aforementioned second chunk of a very long running sentence,
with lines
still broken at max-length into the third chunk of a very long running
sentence.

Diff output
diff --git a/test_file.txt b/test_file.txt
index bd0a68605c..c2cb2d183e 100644
--- a/test_file.txt
+++ b/test_file.txt
@@ -1,4 +1,5 @@
 First chunk of a very long running sentence, with lines broken at max-length
-to lead into the second chunk of a very long running sentence, with lines
+to lead into the aforementioned second chunk of a very long running sentence,
+with lines
 still broken at max-length into the third chunk of a very long running
 sentence.

(Of course, we can also reflow to SemBr or the existing arbitrary hard breaks, which is the most pleasing formatting, and in the case of SemBr avoids this problem in the future, but that would be the largest diff).

FWIW, I would probably do the first option if I was stuck in this situation, as it minimizes the diff and doesn't overflow by that much, which is less visually bothersome than the break anyway.

With SemBr

The main point with SemBr, for text we're writing, rewriting, etc. anyway, is that it mostly avoids having this problem in the first place, and minimizes it if it does happen. It places its breaks at the most semantically meaningful places where insertions are most likely to occur, groups lines semantically and thus localizing changes, allows more "wiggle room" for lines to grow and shrink, and making it straightforward to break a line if needed, and it not affect any others.

If this block was broken with SemBr:

First chunk of a very long running sentence,
with lines broken at max-length to lead into
the second chunk of a very long running sentence,
with lines still broken at max-length into
the third chunk of a very long running sentence.

The change would apply cleanly, with a minimal diff like the first example, while still remaining within the max-width like the second, and unlike either having the changed line be one and only one atomic semantic unit of text:

First chunk of a very long running sentence,
with lines broken at max-length to lead into
the aforementioned second chunk of a very long running sentence,
with lines still broken at max-length into
the third chunk of a very long running sentence.

Diff output
diff --git a/test_file.txt b/test_file.txt
index 1714376555..628fd1a537 100644
--- a/test_file.txt
+++ b/test_file.txt
@@ -1,5 +1,5 @@
 First chunk of a very long running sentence,
 with lines broken at max-length to lead into
-the second chunk of a very long running sentence,
+the aforementioned second chunk of a very long running sentence,
 with lines still broken at max-length into
 the third chunk of a very long running sentence.

Of course, just to be repeat, this demonstrates the benefits of SemBr as a wrapping strategy in minimizing all future reflows and diffs when writing, rewriting or reflowing text; with existing hard-wrapped text in this situation, unless the modifications are comprehensive enough that a reflow is justified, the options are either relax the line length limit, or add an arbitrary break (though SemBr can help determine the most logical location for that latter break).

@willingc
Copy link
Collaborator

Related to #12. In general, I believe that our guidance would likely be for folks to use their best judgment to minimize code/doc churn and to only reflow or change line breaks when it is a significant improvement. I'm going to close #12 about "paragraph reflow".

@willingc
Copy link
Collaborator

Issue #13 also references unnecessary code churn. I'm closing that in preference of this issue.

@willingc willingc added guide-new content Additions; New content or section needed needs: decision Needs consensus decision from core devs enhancement labels Oct 11, 2023
@ezio-melotti
Copy link
Member

ezio-melotti commented Oct 12, 2023

FWIW, when I edit documentation I tend to follow a few guidelines:

  • reflowing is generally fine: an oddly-indented paragraph will stick around for a long time and be read by more people than just a couple of reviewers once
  • after creating a PR, you can write review comments to guide and help reviewers, pointing out -- among other things -- if a chunk got reflowed and only has changes in the first line, or if a block got moved somewhere else
  • for more extensive changes that affect several paragraphs, it usually helps to make two separate commits -- one for the actual changes and one for the reflow -- and point this out in a review comment so that reviewers know to review them separately
  • doing this in two separate PRs is also an option, and makes it even easier for reviewers (but if the change has to be backported to other branches, having two separate commits in the same PR might be easier)
  • changes in indentation and single words are properly highlighted in GitHub, so nowadays they are easy to spot
  • GitHub also has an option to hide whitespace changes, making it easier to spot actual changes when the code has indentation changes, possibly mixed with actual changes
  • when adding a newline, try to add it in a reasonable place (e.g. try not to add it between the subject and the verb it follows, or between an article and a noun1)
  • try to make good use of the available space: don't write too many short lines, but also don't use all the 80 chars to the end on every line -- stopping a bit earlier is generally a good idea and leaves a little wiggle room that helps preventing future reflowings
  • going a few chars over 80 is sometimes also ok to prevent reflowings

Footnotes

  1. This is somewhat similar in spirit to SemBr, but not as strict or comprehensive. SemBr tends to "chunkify" the text a bit too much for my taste, and doesn't use the available vertical space too effectively IMHO.

@erlend-aasland
Copy link
Contributor

Those are good observations, Ezio. +1 from me.

@erlend-aasland
Copy link
Contributor

@willingc, regarding SemBr, see CAM's #1018 (comment).

@erlend-aasland
Copy link
Contributor

Regarding SemBr: I usually reflow modified lines according to SemBr in my own PRs, and I recommend it when reviewing. Of course, I try to write new documentation using the SemBr style, and I think that is a useful goal to strive for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement guide-new content Additions; New content or section needed needs: decision Needs consensus decision from core devs
Projects
None yet
Development

No branches or pull requests

7 participants