feat/parse_html_embed_objects #2233

My3VM · 2023-12-07T16:21:21Z

I am trying to parse HTML documents containing embedded images and youtube videos inside iframe. I am able to use partition_html function get textual elements, as well metdata object containing ahref tags. However the image element as well iframe elements are being missed out.

I would like to have these data points made available either as separete elements like HTMLImage, HTMLIframe or attach these link urls as well made available as part of the metadata object's link_urls.

MthwRobinson · 2024-06-13T13:53:29Z

@scanny - What do you think about this? I think I'd rather avoid dynamically linked videos or images in HTML files. For images at least, converting the HTML to PDF could work to extract the images. I don't think we're likely to do anything with iframes.

scanny · 2024-06-13T17:28:17Z

tl;dr: We could potentially capture those links but probably not traverse them to actually capture the image or video bytes.

`<img>`

It has crossed my mind that we could treat <img> as something like a special case of <a> and capture the image URL as metadata. One challenge is that <img> can contain no text, so we'd need to use a placeholder like "image" or maybe the image alt-text when present for the .metadata.link_text field in the document-element.

Traversing the link and downloading the image is something we might consider at some point, possibly in "hi_res" mode. The key concern there would be avoiding malicious content, which is a non-trivial extra engineering effort and probably still a risk no matter what you do to avoid it.

`<iframe>`

An <iframe> is essentially a link to another web-page that then gets fetched by the browser and displayed in the "frame". Very similar to <img> except a whole HTML page.

I agree that recursively fetching <iframe> web pages and processing them to elements is probably not something we're going to want to support anytime soon. Top of mind for me there would also be the risk of malicious content.

We could extract the link as some sort of metadata, but because <iframe> is empty (that HTML-element can contain no content) there would be no text and therefore no unstructured document-Element to attach that metadata to. So that would require some noodling. We'd need to add a "fake" element or something to go down that route.

MthwRobinson · 2024-06-13T17:35:40Z

Yeah downloading malicious content from the link was my main concern as well. I like the idea of treating <img> similar to links and pulling out the link. Let's keep this one open and we can consider doing that.

My3VM added the enhancement New feature or request label Dec 7, 2023

scanny added the html label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/parse_html_embed_objects #2233

feat/parse_html_embed_objects #2233

My3VM commented Dec 7, 2023

MthwRobinson commented Jun 13, 2024

scanny commented Jun 13, 2024 •

edited

Loading

MthwRobinson commented Jun 13, 2024

feat/parse_html_embed_objects #2233

feat/parse_html_embed_objects #2233

Comments

My3VM commented Dec 7, 2023

MthwRobinson commented Jun 13, 2024

scanny commented Jun 13, 2024 • edited Loading

<img>

<iframe>

MthwRobinson commented Jun 13, 2024

scanny commented Jun 13, 2024 •

edited

Loading

`<img>`

`<iframe>`