refactor: clean_html() to improve HTML sanitization #4663

nbalne · 2025-08-28T09:39:14Z

This PR refactors the clean_html() utility to enhance HTML cleaning while ensuring consistent formatting with generated content.

ticket for this : https://2u-internal.atlassian.net/browse/PROD-4415

Key changes:

Removed non-breaking spaces ( ) to prevent unwanted spacing in output.
Preserved ul and ol tags dir="rtl" attributes by temporarily removing and restoring them after processing.

Improved HTML normalization:

Converted HTML → Markdown → HTML to remove unsupported attributes, inline styles, and normalize tag structure.
Ensured no line wrapping in converted Markdown (bodywidth=None).
Fixed tag spacing issue where anchor tags were stuck to preceding words/tags.
Removed unnecessary hr / tags.
Collapsed multiple blank lines into a single line for cleaner output.
Returned final HTML with trimmed whitespace for consistent output formatting.

IN TEST CASAES
Before Code Update Test SS

After Code Update Test SS

DawoudSheraz · 2025-08-29T06:06:26Z

course_discovery/apps/course_metadata/utils.py

+    cleaned = content.replace('&nbsp;', '')
    soup = BeautifulSoup(cleaned, 'lxml')
-
+    LIST_TAGS = ['ul', 'ol']


Any reason for moving this here?

I moved it inside the function to keep the scope limited, but I can move it back to module-level as before for consistency.

DawoudSheraz · 2025-08-29T06:07:15Z

course_discovery/apps/course_metadata/utils.py

-    return cleaned
+    markdown_text = html_converter.handle(str(soup)).strip()
+    cleaned = markdown.markdown(markdown_text)
+    cleaned = cleaned.replace('<hr />', '')


Why is hr tag replaced? That can impact the content/html added by the editors.

I had initially removed hr / tag to match previous clean_html() behavior in older code, but I agree that removing it could unintentionally strip valid editor content.

DawoudSheraz · 2025-08-29T06:20:05Z

course_discovery/apps/course_metadata/utils.py

+    markdown_text = html_converter.handle(str(soup)).strip()
+    cleaned = markdown.markdown(markdown_text)
+    cleaned = cleaned.replace('<hr />', '')
+    cleaned = re.sub(r'([^\s>])(<a\b)', r'\1 \2', cleaned)


So, the code is searching for <a tag with no space before it and then adding a space explicitly between the content and tag, right? While it might work, it does not exactly fix the problem. Let's say the content had multiple spaces between text and tag, it won't restore to its original value.

You’re right — this regex only fixes the “no space” case and won’t restore the original spacing.
I’ll change the approach to handle both missing space and normalize multiple spaces using BeautifulSoup post-processing instead of regex.

DawoudSheraz · 2025-08-29T06:21:05Z

course_discovery/apps/course_metadata/utils.py

+    if is_list_with_dir_attr_present:
+        for tag in LIST_TAGS:
+            cleaned = cleaned.replace(f'<{tag}>', f'<{tag} dir="rtl">')
+    cleaned = re.sub(r'\n\s*\n', '\n', cleaned)


Why is this cleaning needed? IF the strip is called on the next line, it will remove the extra spaces from the end.

This was intended to collapse multiple consecutive blank lines in the middle of the HTML, not just trim ends. However, I see that in most cases strip() is enough. I’ll remove it unless we find cases where mid-content blank lines cause formatting issues

ankit-sonata · 2025-09-03T13:48:41Z

@nbalne rebase your branch with main and also add the SS of before and after testing results.

ankit-sonata · 2025-09-03T16:53:28Z

course_discovery/apps/course_metadata/tests/test_utils.py

    )
+    @ddt.data(
+        (
+            '<p><em>The content of this course also forms part of the six-month online<a href="https://example.com">Example Link</a></em></p>',  # pylint: disable=line-too-long


why this line is duplicated?

first one is our input we can see the difference at before the anchor tag there is no space is reserved and second one is for our expected result. there is space is reserved.

julrusak · 2025-09-10T19:03:35Z

@mphilbrick211 here is another PR that the 2U team is waiting on. Thanks!

ankit-sonata · 2025-09-17T16:04:09Z

@nbalne rebase your branch with main

ankit-sonata · 2025-09-17T16:05:56Z

course_discovery/apps/course_metadata/tests/test_utils.py

    )
+    @ddt.data(
+        (
+            '<p><em>The content of this course also forms part of the six-month online<a href="https://example.com">Example Link</a></em></p>',  # pylint: disable=line-too-long


online<a href="https://example.com">Example Link</a> should be online <a href="https://example.com">Example Link</a>

ankit-sonata

a3048fa#diff-dbd6e02082248457db038d62841a53ca2f88d65904e3bc54465a69126e2cd4b2R885

Are you sure spacing should not be there before anchor tag "online <a href="

ankit-sonata · 2025-09-19T13:30:05Z

@nbalne rebase the branch with main

nbalne · 2025-09-19T14:29:56Z

@nbalne rebase the branch with main

updated

ankit-sonata · 2025-09-26T12:47:59Z

course_discovery/apps/course_metadata/tests/test_utils.py

+    )
    @ddt.unpack
    def test_clean_html(self, content, expected):
-        """ Verify the method removes unnecessary HTML attributes. """


@nbalne why the docstring is removed maybe you want to update it ?

ankit-sonata · 2025-09-26T12:48:48Z

course_discovery/apps/course_metadata/utils.py

        """
        if not self.is_p_tag_with_dir:
            super().handle_tag(tag, attrs, start)
-


@nbalne try not to unnecessarily update the code

ankit-sonata · 2025-09-26T12:50:14Z

course_discovery/apps/course_metadata/utils.py

-
-    cleaned = str(soup)
-    # Need to clean empty <b> and <p> tags which are converted to <hr/> by html2text
-    cleaned = cleaned.replace('<p><b></b></p>', '')


@nbalne what are you doing here ? why making these code changes what difference it is making?

@ankit-sonata updated

ankit-sonata · 2025-09-26T12:50:57Z

course_discovery/apps/course_metadata/utils.py

    is_list_with_dir_attr_present = False
-
-    cleaned = content.replace('&nbsp;', '')  # Keeping the removal of nbsps for historical consistency
-    # Parse the HTML using BeautifulSoup


@nbalne same here why it is updated

@ankit-sonata updated

ankit-sonata · 2025-09-30T14:25:57Z

course_discovery/apps/course_metadata/tests/test_utils.py

+    @ddt.data(
+        (
+            '<p><em>The content of this course also forms part of the six-month online <a href="https://example.com">Example Link</a></em></p>',  # pylint: disable=line-too-long
+            '<p><em>The content of this course also forms part of the six-month online <a href="https://example.com">Example Link</a></em></p>'  # pylint: disable=line-too-long


Could you clarify what is the difference between these two <p> tag lines? They look identical.

Hi @ankit-sonata The difference between the two strings is that in the input string, there is no space before the tag (the space is missing), whereas in the expected output, a space is included.
I will correct the input string to reflect proper spacing.

nbalne force-pushed the Prod-4415/cleanhtml branch 5 times, most recently from 7b874ce to be0988c Compare August 28, 2025 14:21

nbalne requested review from DawoudSheraz and skumargupta83 August 29, 2025 05:32

DawoudSheraz reviewed Aug 29, 2025

View reviewed changes

ankit-sonata suggested changes Sep 3, 2025

View reviewed changes

nbalne force-pushed the Prod-4415/cleanhtml branch 8 times, most recently from 75b94c6 to c281863 Compare September 8, 2025 07:55

ankit-sonata suggested changes Sep 17, 2025

View reviewed changes

nbalne force-pushed the Prod-4415/cleanhtml branch from c281863 to a3048fa Compare September 18, 2025 07:13

ankit-sonata suggested changes Sep 18, 2025

View reviewed changes

nbalne force-pushed the Prod-4415/cleanhtml branch from a3048fa to 7a3719e Compare September 19, 2025 05:52

nbalne force-pushed the Prod-4415/cleanhtml branch from 7a3719e to 8c81b05 Compare September 19, 2025 14:12

nbalne force-pushed the Prod-4415/cleanhtml branch 6 times, most recently from e2cdad0 to 71dd59a Compare September 25, 2025 11:11

ankit-sonata suggested changes Sep 26, 2025

View reviewed changes

nbalne force-pushed the Prod-4415/cleanhtml branch 4 times, most recently from 6ad4ad2 to a0363c0 Compare September 30, 2025 07:39

ankit-sonata reviewed Sep 30, 2025

View reviewed changes

nbalne force-pushed the Prod-4415/cleanhtml branch 3 times, most recently from e8dbf5d to aebba93 Compare October 6, 2025 11:00

nbalne force-pushed the Prod-4415/cleanhtml branch 2 times, most recently from 31a4912 to 8924626 Compare October 16, 2025 06:47

nbalne force-pushed the Prod-4415/cleanhtml branch 3 times, most recently from 98f646e to eda8fad Compare November 12, 2025 11:04

refactor: clean_html() to improve HTML sanitization

fcb7012

nbalne force-pushed the Prod-4415/cleanhtml branch from eda8fad to fcb7012 Compare November 12, 2025 11:11

refactor: clean_html() to improve HTML sanitization #4663

Are you sure you want to change the base?

refactor: clean_html() to improve HTML sanitization #4663

Uh oh!

Conversation

nbalne commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nbalne Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankit-sonata commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

julrusak commented Sep 10, 2025

Uh oh!

ankit-sonata commented Sep 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankit-sonata left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankit-sonata commented Sep 19, 2025

Uh oh!

nbalne commented Sep 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nbalne Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nbalne Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

nbalne commented Aug 28, 2025 •

edited

Loading

nbalne Sep 1, 2025 •

edited

Loading

ankit-sonata commented Sep 3, 2025 •

edited

Loading

ankit-sonata left a comment •

edited

Loading

nbalne Sep 29, 2025 •

edited

Loading

nbalne Sep 29, 2025 •

edited

Loading