Skip to content

Conversation

@nbalne
Copy link

@nbalne nbalne commented Aug 28, 2025

This PR refactors the clean_html() utility to enhance HTML cleaning while ensuring consistent formatting with generated content.

ticket for this : https://2u-internal.atlassian.net/browse/PROD-4415

Key changes:

  • Removed non-breaking spaces ( ) to prevent unwanted spacing in output.
  • Preserved ul and ol tags dir="rtl" attributes by temporarily removing and restoring them after processing.

Improved HTML normalization:

  • Converted HTML → Markdown → HTML to remove unsupported attributes, inline styles, and normalize tag structure.
  • Ensured no line wrapping in converted Markdown (bodywidth=None).
  • Fixed tag spacing issue where anchor tags were stuck to preceding words/tags.
  • Removed unnecessary hr / tags.
  • Collapsed multiple blank lines into a single line for cleaner output.
  • Returned final HTML with trimmed whitespace for consistent output formatting.

IN TEST CASAES
Before Code Update Test SS
image (1)

After Code Update Test SS
Clean_html() after

@nbalne nbalne force-pushed the Prod-4415/cleanhtml branch 5 times, most recently from 7b874ce to be0988c Compare August 28, 2025 14:21
cleaned = content.replace(' ', '')
soup = BeautifulSoup(cleaned, 'lxml')

LIST_TAGS = ['ul', 'ol']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for moving this here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it inside the function to keep the scope limited, but I can move it back to module-level as before for consistency.

return cleaned
markdown_text = html_converter.handle(str(soup)).strip()
cleaned = markdown.markdown(markdown_text)
cleaned = cleaned.replace('<hr />', '')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is hr tag replaced? That can impact the content/html added by the editors.

Copy link
Author

@nbalne nbalne Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had initially removed hr / tag to match previous clean_html() behavior in older code, but I agree that removing it could unintentionally strip valid editor content.

markdown_text = html_converter.handle(str(soup)).strip()
cleaned = markdown.markdown(markdown_text)
cleaned = cleaned.replace('<hr />', '')
cleaned = re.sub(r'([^\s>])(<a\b)', r'\1 \2', cleaned)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the code is searching for <a tag with no space before it and then adding a space explicitly between the content and tag, right? While it might work, it does not exactly fix the problem. Let's say the content had multiple spaces between text and tag, it won't restore to its original value.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re right — this regex only fixes the “no space” case and won’t restore the original spacing.
I’ll change the approach to handle both missing space and normalize multiple spaces using BeautifulSoup post-processing instead of regex.

if is_list_with_dir_attr_present:
for tag in LIST_TAGS:
cleaned = cleaned.replace(f'<{tag}>', f'<{tag} dir="rtl">')
cleaned = re.sub(r'\n\s*\n', '\n', cleaned)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this cleaning needed? IF the strip is called on the next line, it will remove the extra spaces from the end.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intended to collapse multiple consecutive blank lines in the middle of the HTML, not just trim ends. However, I see that in most cases strip() is enough. I’ll remove it unless we find cases where mid-content blank lines cause formatting issues

@ankit-sonata
Copy link

ankit-sonata commented Sep 3, 2025

@nbalne rebase your branch with main and also add the SS of before and after testing results.

)
@ddt.data(
(
'<p><em>The content of this course also forms part of the six-month online<a href="https://example.com">Example Link</a></em></p>', # pylint: disable=line-too-long

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this line is duplicated?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first one is our input we can see the difference at before the anchor tag there is no space is reserved and second one is for our expected result. there is space is reserved.

@nbalne nbalne force-pushed the Prod-4415/cleanhtml branch 8 times, most recently from 75b94c6 to c281863 Compare September 8, 2025 07:55
@julrusak
Copy link

@mphilbrick211 here is another PR that the 2U team is waiting on. Thanks!

@ankit-sonata
Copy link

@nbalne rebase your branch with main

)
@ddt.data(
(
'<p><em>The content of this course also forms part of the six-month online<a href="https://example.com">Example Link</a></em></p>', # pylint: disable=line-too-long

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

online<a href="https://example.com">Example Link</a> should be online <a href="https://example.com">Example Link</a>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@nbalne nbalne force-pushed the Prod-4415/cleanhtml branch from c281863 to a3048fa Compare September 18, 2025 07:13
Copy link

@ankit-sonata ankit-sonata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a3048fa#diff-dbd6e02082248457db038d62841a53ca2f88d65904e3bc54465a69126e2cd4b2R885

Are you sure spacing should not be there before anchor tag "online <a href="

@nbalne nbalne force-pushed the Prod-4415/cleanhtml branch from a3048fa to 7a3719e Compare September 19, 2025 05:52
@ankit-sonata
Copy link

@nbalne rebase the branch with main

@nbalne nbalne force-pushed the Prod-4415/cleanhtml branch from 7a3719e to 8c81b05 Compare September 19, 2025 14:12
@nbalne
Copy link
Author

nbalne commented Sep 19, 2025

@nbalne rebase the branch with main

updated

@nbalne nbalne force-pushed the Prod-4415/cleanhtml branch 6 times, most recently from e2cdad0 to 71dd59a Compare September 25, 2025 11:11
)
@ddt.unpack
def test_clean_html(self, content, expected):
""" Verify the method removes unnecessary HTML attributes. """

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nbalne why the docstring is removed maybe you want to update it ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

"""
if not self.is_p_tag_with_dir:
super().handle_tag(tag, attrs, start)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nbalne try not to unnecessarily update the code

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines 787 to 790

cleaned = str(soup)
# Need to clean empty <b> and <p> tags which are converted to <hr/> by html2text
cleaned = cleaned.replace('<p><b></b></p>', '')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nbalne what are you doing here ? why making these code changes what difference it is making?

Copy link
Author

@nbalne nbalne Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ankit-sonata updated

is_list_with_dir_attr_present = False

cleaned = content.replace('&nbsp;', '') # Keeping the removal of nbsps for historical consistency
# Parse the HTML using BeautifulSoup

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nbalne same here why it is updated

Copy link
Author

@nbalne nbalne Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ankit-sonata updated

@nbalne nbalne force-pushed the Prod-4415/cleanhtml branch 4 times, most recently from 6ad4ad2 to a0363c0 Compare September 30, 2025 07:39
@ddt.data(
(
'<p><em>The content of this course also forms part of the six-month online <a href="https://example.com">Example Link</a></em></p>', # pylint: disable=line-too-long
'<p><em>The content of this course also forms part of the six-month online <a href="https://example.com">Example Link</a></em></p>' # pylint: disable=line-too-long

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify what is the difference between these two <p> tag lines? They look identical.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ankit-sonata The difference between the two strings is that in the input string, there is no space before the tag (the space is missing), whereas in the expected output, a space is included.
I will correct the input string to reflect proper spacing.

@nbalne nbalne force-pushed the Prod-4415/cleanhtml branch 3 times, most recently from e8dbf5d to aebba93 Compare October 6, 2025 11:00
@nbalne nbalne force-pushed the Prod-4415/cleanhtml branch 2 times, most recently from 31a4912 to 8924626 Compare October 16, 2025 06:47
@nbalne nbalne force-pushed the Prod-4415/cleanhtml branch 3 times, most recently from 98f646e to eda8fad Compare November 12, 2025 11:04
@nbalne nbalne force-pushed the Prod-4415/cleanhtml branch from eda8fad to fcb7012 Compare November 12, 2025 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants