-
Notifications
You must be signed in to change notification settings - Fork 178
Description
Hi 👋
While working on Confluence page parsing related tasks, we found that markdownify had various behaviors depending on how <img> tags are used in the DOM.
TL;DR: markdownify doesn't support <img> tags mixed in with <img /> tags which causes some images to be omitted in the Markdown output.
What Specifications Say
Before diving into examples, I checked different RFC specifications to make sure that <img /> and <img> tags are valid tags and it's the case as XHTML/XML uses <img /> and HTML5 uses <img> as specified in Section 9.5.3 of RFC 7992.
Examples
When images are present in the DOM with both autoclosing and non-autoclosing tags, issues occur.
Base Test
As reference, we use a fully autoclosing <img /> tag example, this is the ideal case for markdownify.
from bs4 import BeautifulSoup
from markdownify import markdownify
if __name__ == '__main__':
content = '''<table>
<tbody>
<tr>
<td>
<span><img src="https://placehold.co/600x400/gray/white" alt="(gray)" name=":gray:"/></span>
</td>
<td>
<p>
<img src="https://placehold.co/600x400/orange/white" alt="(orange)" name=":orange:"/> / <img src="https://placehold.co/600x400/blue/white" alt="(blue)" name=":blue:"/>
</p>
</td>
</tr>
</tbody>
</table>'''
print('BeautifulSoup Output:\n')
print(BeautifulSoup(content, "html.parser"), end='\n' * 3)
print('Markdownify Output:\n')
print(markdownify(content, heading_style="ATX"))This produces the following expected output:
BeautifulSoup Output:
<table>
<tbody>
<tr>
<td>
<span><img alt="(gray)" name=":gray:" src="https://placehold.co/600x400/gray/white"/></span>
</td>
<td>
<p>
<img alt="(orange)" name=":orange:" src="https://placehold.co/600x400/orange/white"/> / <img alt="(blue)" name=":blue:" src="https://placehold.co/600x400/blue/white"/>
</p>
</td>
</tr>
</tbody>
</table>
Markdownify Output:
| | |
| --- | --- |
| (gray) | (orange) / (blue) |
As you can see, every image is converted to its alt attribute value and every alt is present.
Example 1
If we spice things up and change the first image to an non-autoclosing <img> tag:
from bs4 import BeautifulSoup
from markdownify import markdownify
if __name__ == '__main__':
content = '''<table>
<tbody>
<tr>
<td>
<span><img src="https://placehold.co/600x400/gray/white" alt="(gray)" name=":gray:"></span>
</td>
<td>
<p>
<img src="https://placehold.co/600x400/orange/white" alt="(orange)" name=":orange:"/> / <img src="https://placehold.co/600x400/blue/white" alt="(blue)" name=":blue:"/>
</p>
</td>
</tr>
</tbody>
</table>'''
print('BeautifulSoup Output:\n')
print(BeautifulSoup(content, "html.parser"), end='\n' * 3)
print('Markdownify Output:\n')
print(markdownify(content, heading_style="ATX"))The first (blue) image disappears from the output:
BeautifulSoup Output:
<table>
<tbody>
<tr>
<td>
<span><img alt="(gray)" name=":gray:" src="https://placehold.co/600x400/gray/white"/></span>
</td>
<td>
<p>
<img alt="(orange)" name=":orange:" src="https://placehold.co/600x400/orange/white"> / <img alt="(blue)" name=":blue:" src="https://placehold.co/600x400/blue/white"/>
</img></p>
</td>
</tr>
</tbody>
</table>
Markdownify Output:
| | |
| --- | --- |
| (gray) | (orange) |
Please notice that BeautifulSoup interpreted the first <img> tag as autoclosing and the second <img /> tag as non-autoclosing 🤔
Example 2
We can go further and only leave the (orange) image as autoclosing:
from bs4 import BeautifulSoup
from markdownify import markdownify
if __name__ == '__main__':
content = '''<table>
<tbody>
<tr>
<td>
<span><img src="https://placehold.co/600x400/gray/white" alt="(gray)" name=":gray:"></span>
</td>
<td>
<p>
<img src="https://placehold.co/600x400/orange/white" alt="(orange)" name=":orange:"/> / <img src="https://placehold.co/600x400/blue/white" alt="(blue)" name=":blue:">
</p>
</td>
</tr>
</tbody>
</table>'''
print('BeautifulSoup Output:\n')
print(BeautifulSoup(content, "html.parser"), end='\n' * 3)
print('Markdownify Output:\n')
print(markdownify(content, heading_style="ATX"))This produces the same case as Example 1:
BeautifulSoup Output:
<table>
<tbody>
<tr>
<td>
<span><img alt="(gray)" name=":gray:" src="https://placehold.co/600x400/gray/white"/></span>
</td>
<td>
<p>
<img alt="(orange)" name=":orange:" src="https://placehold.co/600x400/orange/white"> / <img alt="(blue)" name=":blue:" src="https://placehold.co/600x400/blue/white"/>
</img></p>
</td>
</tr>
</tbody>
</table>
Markdownify Output:
| | |
| --- | --- |
| (gray) | (orange) |
Combinations
Other combinations seem to work fine.
Here is a table of all the cases I've encountered:
| Gray | Orange | Blue | Result |
|---|---|---|---|
<img> |
<img> |
<img> |
✅ |
<img> |
<img> |
<img /> |
✅ |
<img> |
<img /> |
<img> |
❌ (blue) is missing |
<img> |
<img /> |
<img /> |
❌ (blue) is missing |
<img /> |
<img> |
<img> |
✅ |
<img /> |
<img> |
<img /> |
✅ |
<img /> |
<img /> |
<img> |
✅ |
<img /> |
<img /> |
<img /> |
✅ |
Thanks for your help 🙏
