-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Remove JavaScript from HTML reader output #313
fix: Remove JavaScript from HTML reader output #313
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending one tweak to the CHANGELOG. Thanks!
CHANGELOG.md
Outdated
@@ -1,4 +1,4 @@ | |||
## 0.5.0 | |||
## 0.5.0-dev1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add this under 0.5.1-dev1
and add it as a new entry above 0.5.0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually looks like there's a test failing. Please also add a new test that includes a script tags and tests that the script content isn't included in the output.
There is indeed! I converted it to draft until I resolve it. |
As it turns out, itertext has "with_tail=True" by default, and performing the same manually with "iter" requires manually adding the tail
…into fix/remove_scripts_from_html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add a test to test_html.py
to test that the script tag content gets removed?
Will do |
Or rather, let 'make tidy' do it for me ;)
Should be ready now. Apologies for forgetting to include the new test earlier. The test case is simple: it just checks whether the output elements contain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with the test added. Thanks!
Closes #149
Hello!
Pull Request overview
Bug
See #149 for details - it contains the following snippet:
Which outputs:
Details
The bug was caused by the
_construct_text
function, which was called using a<td>
tag. Within this tag, there is a<script>
tag, containing the JS. The_construct_text
function will happily loop through this<script>
tag and consider all of its contents as "text".This fix verifies whether the tag that we're looping over is not a
<script>
tag, but recognize that this is a bit of a bandaid solution on a bigger problem: there may be other tags for which we don't want the text to be included in the HTML reader output.However, for the time being, this is a major improvement.
After this PR