Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(opinions): Add filter linebreaksbr to html #3378

Merged
merged 7 commits into from
Nov 28, 2023
Merged

Conversation

flooie
Copy link
Contributor

@flooie flooie commented Nov 15, 2023

Fix #3377

Fix missing line breaks that sporadically affect our HTML with citations. Add django filter for line breaks where new lines exist and it cleans up our HTML beautifully.

Before
283200255-208ce341-f434-4597-8337-ab103d1cbaef

After
Screenshot 2023-11-15 at 3 00 10 PM

This actually resolves the issue with
bad or missing newlines based on the document
@flooie flooie requested a review from mlissner November 15, 2023 20:01
@mlissner
Copy link
Member

Yeah, I don't know, Bill. You don't want to put zillions of <br> tags into the page. It'll make it slower, but even if that were negligible, it would also just make it uglier? Is there a better way to do this?

@flooie
Copy link
Contributor Author

flooie commented Nov 17, 2023

@mlissner

The problem we are seeing is that the HTML is not rendering correctly when new lines are inside <pre> tags or maybe other types. I'm sorry that I wasn't clear about what is happening here so let me back up and take a second stab at it.

The above example I used for illustration has a stretch of text that looks like this on our website.

liability. See Burlington Industries, Inc. v. Ellerth, 524 U.S. 742(1998); Faragher v. Boca Raton, 524 U.S. 775 (1998). Liability

All one line - no spacing. If you look at the PDF - it should look like this.

liability. See Burlington Industries, Inc. v. Ellerth, 524 U.S. 742
(1998); Faragher v. Boca Raton, 524 U.S. 775 (1998). Liability

with a break after the Citation and a new line starting at (1998).

I'm not an expert in HTML with citations being generated but it's clear what is happening here. Lets look at the html_with_citations content im going to add new lines for ease of reading

...\r\n\r\nliability. See Burlington Industries, Inc. v. Ellerth, </pre>
<span class="citation" data-id="118244">
<a href="/opinion/118244/burlington-industries-inc-v-ellerth/">524 U.S. 742</a></span>
<pre class="inline">\r\n(1998); 
Faragher v. Boca Raton, 
</pre><span class="citation" data-id="118245">
<a href="/opinion/118245/faragher-v-boca-raton/">524 U.S. 775</a>
</span><pre class="inline"> (1998). 
Liability\r\nunder Monell v. New York City Department of Social Services,\r\n</pre>

See the <pre class="inline">\r\n(1998); <- those returns are what are being dropped and causing our runoffs.

@mlissner
Copy link
Member

OK, sorry, I definitely misunderstood how linebreaksbr works. Somehow I thought it was going to convert all whitespace to br's.

I'm a bit concerned though because usually HTML is supposed to ignore line returns and white space. Have you tried this on a bunch of different content types and sources to make sure it doesn't have any weird effects?

@flooie
Copy link
Contributor Author

flooie commented Nov 21, 2023

I've looked at maybe a dozen or so examples, but there are more than that. Though I suspect that was enough to get a sense. I saw nothing other than it nicely processing the HTM - Do you want me to do a much more extensive review?

My expectation is that this is only for scraped opinions.

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the first change you made is focused on things that have source of C. That seems mostly OK, because those are mostly PDFs with bad plain text, but even some of those come in as HTML where we don't control the line returns.

The second and third changes below are not from the scrapers and worry me. My concern is that if we scrape some HTML and it has a bunch of line returns in it, we'll convert those line returns to <br> tags and make the content really long without meaning to. In general line returns in HTML should be ignored, but this does the opposite.

So I think:

  • The first one might be fine. At least, for PDFs it should be.
  • The second and third ones worry me and seem like they could cause trouble.

@flooie
Copy link
Contributor Author

flooie commented Nov 22, 2023

great. So lets drop the second and third one - leave the first - which was my primary focus at the start and just keep a watchful eye. its still a very easy and quick revert @mlissner yes?

@mlissner
Copy link
Member

Sounds good.

@flooie
Copy link
Contributor Author

flooie commented Nov 28, 2023

@mlissner this I think is ready for your approval. It failed a web hook test - but I am rerunning and it and would suggest that its likely not related and deals with some other bug in web hooks?

@mlissner mlissner merged commit 24191e5 into main Nov 28, 2023
12 of 13 checks passed
@mlissner mlissner deleted the 3377-fix-text-width branch November 28, 2023 17:17
@mlissner
Copy link
Member

Cool. Let's see how this goes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Text escapes div on some opinion content
2 participants