-
-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(opinions): Add filter linebreaksbr to html #3378
Conversation
This actually resolves the issue with bad or missing newlines based on the document
Yeah, I don't know, Bill. You don't want to put zillions of |
The problem we are seeing is that the HTML is not rendering correctly when new lines are inside The above example I used for illustration has a stretch of text that looks like this on our website. liability. See Burlington Industries, Inc. v. Ellerth, 524 U.S. 742(1998); Faragher v. Boca Raton, 524 U.S. 775 (1998). Liability All one line - no spacing. If you look at the PDF - it should look like this. liability. See Burlington Industries, Inc. v. Ellerth, 524 U.S. 742 with a break after the Citation and a new line starting at (1998). I'm not an expert in HTML with citations being generated but it's clear what is happening here. Lets look at the html_with_citations content im going to add new lines for ease of reading
See the |
OK, sorry, I definitely misunderstood how linebreaksbr works. Somehow I thought it was going to convert all whitespace to br's. I'm a bit concerned though because usually HTML is supposed to ignore line returns and white space. Have you tried this on a bunch of different content types and sources to make sure it doesn't have any weird effects? |
I've looked at maybe a dozen or so examples, but there are more than that. Though I suspect that was enough to get a sense. I saw nothing other than it nicely processing the HTM - Do you want me to do a much more extensive review? My expectation is that this is only for scraped opinions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the first change you made is focused on things that have source of C. That seems mostly OK, because those are mostly PDFs with bad plain text, but even some of those come in as HTML where we don't control the line returns.
The second and third changes below are not from the scrapers and worry me. My concern is that if we scrape some HTML and it has a bunch of line returns in it, we'll convert those line returns to <br>
tags and make the content really long without meaning to. In general line returns in HTML should be ignored, but this does the opposite.
So I think:
- The first one might be fine. At least, for PDFs it should be.
- The second and third ones worry me and seem like they could cause trouble.
great. So lets drop the second and third one - leave the first - which was my primary focus at the start and just keep a watchful eye. its still a very easy and quick revert @mlissner yes? |
Sounds good. |
@mlissner this I think is ready for your approval. It failed a web hook test - but I am rerunning and it and would suggest that its likely not related and deals with some other bug in web hooks? |
Cool. Let's see how this goes. |
Fix #3377
Fix missing line breaks that sporadically affect our HTML with citations. Add django filter for line breaks where new lines exist and it cleans up our HTML beautifully.
Before
After