Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/parse_html_embed_objects #2233

Open
My3VM opened this issue Dec 7, 2023 · 3 comments
Open

feat/parse_html_embed_objects #2233

My3VM opened this issue Dec 7, 2023 · 3 comments
Labels
enhancement New feature or request html

Comments

@My3VM
Copy link

My3VM commented Dec 7, 2023

I am trying to parse HTML documents containing embedded images and youtube videos inside iframe. I am able to use partition_html function get textual elements, as well metdata object containing ahref tags. However the image element as well iframe elements are being missed out.

I would like to have these data points made available either as separete elements like HTMLImage, HTMLIframe or attach these link urls as well made available as part of the metadata object's link_urls.

@My3VM My3VM added the enhancement New feature or request label Dec 7, 2023
@MthwRobinson
Copy link
Contributor

@scanny - What do you think about this? I think I'd rather avoid dynamically linked videos or images in HTML files. For images at least, converting the HTML to PDF could work to extract the images. I don't think we're likely to do anything with iframes.

@scanny
Copy link
Collaborator

scanny commented Jun 13, 2024

tl;dr: We could potentially capture those links but probably not traverse them to actually capture the image or video bytes.

<img>

It has crossed my mind that we could treat <img> as something like a special case of <a> and capture the image URL as metadata. One challenge is that <img> can contain no text, so we'd need to use a placeholder like "image" or maybe the image alt-text when present for the .metadata.link_text field in the document-element.

Traversing the link and downloading the image is something we might consider at some point, possibly in "hi_res" mode. The key concern there would be avoiding malicious content, which is a non-trivial extra engineering effort and probably still a risk no matter what you do to avoid it.

<iframe>

An <iframe> is essentially a link to another web-page that then gets fetched by the browser and displayed in the "frame". Very similar to <img> except a whole HTML page.

I agree that recursively fetching <iframe> web pages and processing them to elements is probably not something we're going to want to support anytime soon. Top of mind for me there would also be the risk of malicious content.

We could extract the link as some sort of metadata, but because <iframe> is empty (that HTML-element can contain no content) there would be no text and therefore no unstructured document-Element to attach that metadata to. So that would require some noodling. We'd need to add a "fake" element or something to go down that route.

@MthwRobinson
Copy link
Contributor

Yeah downloading malicious content from the link was my main concern as well. I like the idea of treating <img> similar to links and pulling out the link. Let's keep this one open and we can consider doing that.

@scanny scanny added the html label Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request html
Projects
None yet
Development

No branches or pull requests

3 participants