Skip to content

Autoclosing / Non-Autoclosing HTML tags support #205

@vincentkelleher

Description

@vincentkelleher

Hi 👋

While working on Confluence page parsing related tasks, we found that markdownify had various behaviors depending on how <img> tags are used in the DOM.

TL;DR: markdownify doesn't support <img> tags mixed in with <img /> tags which causes some images to be omitted in the Markdown output.

What Specifications Say

Before diving into examples, I checked different RFC specifications to make sure that <img /> and <img> tags are valid tags and it's the case as XHTML/XML uses <img /> and HTML5 uses <img> as specified in Section 9.5.3 of RFC 7992.

Image

Examples

When images are present in the DOM with both autoclosing and non-autoclosing tags, issues occur.

Base Test

As reference, we use a fully autoclosing <img /> tag example, this is the ideal case for markdownify.

from bs4 import BeautifulSoup
from markdownify import markdownify


if __name__ == '__main__':
    content = '''<table>
        <tbody>
            <tr>
                <td>
                    <span><img src="https://placehold.co/600x400/gray/white" alt="(gray)" name=":gray:"/></span>
                </td>
                <td>
                    <p>
                        <img src="https://placehold.co/600x400/orange/white" alt="(orange)" name=":orange:"/> / <img src="https://placehold.co/600x400/blue/white" alt="(blue)" name=":blue:"/>
                    </p>
                </td>
            </tr>
        </tbody>
    </table>'''

    print('BeautifulSoup Output:\n')
    print(BeautifulSoup(content, "html.parser"), end='\n' * 3)

    print('Markdownify Output:\n')
    print(markdownify(content, heading_style="ATX"))

This produces the following expected output:

BeautifulSoup Output:

<table>
<tbody>
<tr>
<td>
<span><img alt="(gray)" name=":gray:" src="https://placehold.co/600x400/gray/white"/></span>
</td>
<td>
<p>
<img alt="(orange)" name=":orange:" src="https://placehold.co/600x400/orange/white"/> / <img alt="(blue)" name=":blue:" src="https://placehold.co/600x400/blue/white"/>
</p>
</td>
</tr>
</tbody>
</table>


Markdownify Output:

|  |  |
| --- | --- |
| (gray) | (orange) / (blue) |

As you can see, every image is converted to its alt attribute value and every alt is present.

Example 1

If we spice things up and change the first image to an non-autoclosing <img> tag:

from bs4 import BeautifulSoup
from markdownify import markdownify


if __name__ == '__main__':
    content = '''<table>
        <tbody>
            <tr>
                <td>
                    <span><img src="https://placehold.co/600x400/gray/white" alt="(gray)" name=":gray:"></span>
                </td>
                <td>
                    <p>
                        <img src="https://placehold.co/600x400/orange/white" alt="(orange)" name=":orange:"/> / <img src="https://placehold.co/600x400/blue/white" alt="(blue)" name=":blue:"/>
                    </p>
                </td>
            </tr>
        </tbody>
    </table>'''

    print('BeautifulSoup Output:\n')
    print(BeautifulSoup(content, "html.parser"), end='\n' * 3)

    print('Markdownify Output:\n')
    print(markdownify(content, heading_style="ATX"))

The first (blue) image disappears from the output:

BeautifulSoup Output:

<table>
<tbody>
<tr>
<td>
<span><img alt="(gray)" name=":gray:" src="https://placehold.co/600x400/gray/white"/></span>
</td>
<td>
<p>
<img alt="(orange)" name=":orange:" src="https://placehold.co/600x400/orange/white"> / <img alt="(blue)" name=":blue:" src="https://placehold.co/600x400/blue/white"/>
</img></p>
</td>
</tr>
</tbody>
</table>


Markdownify Output:

|  |  |
| --- | --- |
| (gray) | (orange) |

Please notice that BeautifulSoup interpreted the first <img> tag as autoclosing and the second <img /> tag as non-autoclosing 🤔

Example 2

We can go further and only leave the (orange) image as autoclosing:

from bs4 import BeautifulSoup
from markdownify import markdownify


if __name__ == '__main__':
    content = '''<table>
        <tbody>
            <tr>
                <td>
                    <span><img src="https://placehold.co/600x400/gray/white" alt="(gray)" name=":gray:"></span>
                </td>
                <td>
                    <p>
                        <img src="https://placehold.co/600x400/orange/white" alt="(orange)" name=":orange:"/> / <img src="https://placehold.co/600x400/blue/white" alt="(blue)" name=":blue:">
                    </p>
                </td>
            </tr>
        </tbody>
    </table>'''

    print('BeautifulSoup Output:\n')
    print(BeautifulSoup(content, "html.parser"), end='\n' * 3)

    print('Markdownify Output:\n')
    print(markdownify(content, heading_style="ATX"))

This produces the same case as Example 1:

BeautifulSoup Output:

<table>
<tbody>
<tr>
<td>
<span><img alt="(gray)" name=":gray:" src="https://placehold.co/600x400/gray/white"/></span>
</td>
<td>
<p>
<img alt="(orange)" name=":orange:" src="https://placehold.co/600x400/orange/white"> / <img alt="(blue)" name=":blue:" src="https://placehold.co/600x400/blue/white"/>
</img></p>
</td>
</tr>
</tbody>
</table>


Markdownify Output:

|  |  |
| --- | --- |
| (gray) | (orange) |

Combinations

Other combinations seem to work fine.

Here is a table of all the cases I've encountered:

Gray Orange Blue Result
<img> <img> <img>
<img> <img> <img />
<img> <img /> <img> (blue) is missing
<img> <img /> <img /> (blue) is missing
<img /> <img> <img>
<img /> <img> <img />
<img /> <img /> <img>
<img /> <img /> <img />

Thanks for your help 🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions