-
Notifications
You must be signed in to change notification settings - Fork 188
refactor: clean_html() to improve HTML sanitization #4663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
7b874ce to
be0988c
Compare
| cleaned = content.replace(' ', '') | ||
| soup = BeautifulSoup(cleaned, 'lxml') | ||
|
|
||
| LIST_TAGS = ['ul', 'ol'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for moving this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it inside the function to keep the scope limited, but I can move it back to module-level as before for consistency.
| return cleaned | ||
| markdown_text = html_converter.handle(str(soup)).strip() | ||
| cleaned = markdown.markdown(markdown_text) | ||
| cleaned = cleaned.replace('<hr />', '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is hr tag replaced? That can impact the content/html added by the editors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had initially removed hr / tag to match previous clean_html() behavior in older code, but I agree that removing it could unintentionally strip valid editor content.
| markdown_text = html_converter.handle(str(soup)).strip() | ||
| cleaned = markdown.markdown(markdown_text) | ||
| cleaned = cleaned.replace('<hr />', '') | ||
| cleaned = re.sub(r'([^\s>])(<a\b)', r'\1 \2', cleaned) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the code is searching for <a tag with no space before it and then adding a space explicitly between the content and tag, right? While it might work, it does not exactly fix the problem. Let's say the content had multiple spaces between text and tag, it won't restore to its original value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re right — this regex only fixes the “no space” case and won’t restore the original spacing.
I’ll change the approach to handle both missing space and normalize multiple spaces using BeautifulSoup post-processing instead of regex.
| if is_list_with_dir_attr_present: | ||
| for tag in LIST_TAGS: | ||
| cleaned = cleaned.replace(f'<{tag}>', f'<{tag} dir="rtl">') | ||
| cleaned = re.sub(r'\n\s*\n', '\n', cleaned) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this cleaning needed? IF the strip is called on the next line, it will remove the extra spaces from the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was intended to collapse multiple consecutive blank lines in the middle of the HTML, not just trim ends. However, I see that in most cases strip() is enough. I’ll remove it unless we find cases where mid-content blank lines cause formatting issues
|
@nbalne rebase your branch with main and also add the SS of before and after testing results. |
| ) | ||
| @ddt.data( | ||
| ( | ||
| '<p><em>The content of this course also forms part of the six-month online<a href="https://example.com">Example Link</a></em></p>', # pylint: disable=line-too-long |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this line is duplicated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first one is our input we can see the difference at before the anchor tag there is no space is reserved and second one is for our expected result. there is space is reserved.
75b94c6 to
c281863
Compare
|
@mphilbrick211 here is another PR that the 2U team is waiting on. Thanks! |
|
@nbalne rebase your branch with main |
| ) | ||
| @ddt.data( | ||
| ( | ||
| '<p><em>The content of this course also forms part of the six-month online<a href="https://example.com">Example Link</a></em></p>', # pylint: disable=line-too-long |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
online<a href="https://example.com">Example Link</a> should be online <a href="https://example.com">Example Link</a>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
c281863 to
a3048fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a3048fa#diff-dbd6e02082248457db038d62841a53ca2f88d65904e3bc54465a69126e2cd4b2R885
Are you sure spacing should not be there before anchor tag "online <a href="
a3048fa to
7a3719e
Compare
|
@nbalne rebase the branch with main |
7a3719e to
8c81b05
Compare
updated |
e2cdad0 to
71dd59a
Compare
| ) | ||
| @ddt.unpack | ||
| def test_clean_html(self, content, expected): | ||
| """ Verify the method removes unnecessary HTML attributes. """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nbalne why the docstring is removed maybe you want to update it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
| """ | ||
| if not self.is_p_tag_with_dir: | ||
| super().handle_tag(tag, attrs, start) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nbalne try not to unnecessarily update the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
|
|
||
| cleaned = str(soup) | ||
| # Need to clean empty <b> and <p> tags which are converted to <hr/> by html2text | ||
| cleaned = cleaned.replace('<p><b></b></p>', '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nbalne what are you doing here ? why making these code changes what difference it is making?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ankit-sonata updated
| is_list_with_dir_attr_present = False | ||
|
|
||
| cleaned = content.replace(' ', '') # Keeping the removal of nbsps for historical consistency | ||
| # Parse the HTML using BeautifulSoup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nbalne same here why it is updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ankit-sonata updated
6ad4ad2 to
a0363c0
Compare
| @ddt.data( | ||
| ( | ||
| '<p><em>The content of this course also forms part of the six-month online <a href="https://example.com">Example Link</a></em></p>', # pylint: disable=line-too-long | ||
| '<p><em>The content of this course also forms part of the six-month online <a href="https://example.com">Example Link</a></em></p>' # pylint: disable=line-too-long |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify what is the difference between these two <p> tag lines? They look identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ankit-sonata The difference between the two strings is that in the input string, there is no space before the tag (the space is missing), whereas in the expected output, a space is included.
I will correct the input string to reflect proper spacing.
e8dbf5d to
aebba93
Compare
31a4912 to
8924626
Compare
98f646e to
eda8fad
Compare
eda8fad to
fcb7012
Compare
This PR refactors the clean_html() utility to enhance HTML cleaning while ensuring consistent formatting with generated content.
ticket for this : https://2u-internal.atlassian.net/browse/PROD-4415
Key changes:
Improved HTML normalization:
IN TEST CASAES

Before Code Update Test SS
After Code Update Test SS
