Skip to content

Fix HTTPFileSystem isdir downloads the whole file issue #1889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

amastilovic
Copy link

Method _ls_real tries to download the whole r.text() of a link regardless of the type of HTML content. Prevent this download in all cases except when Content-Type header is not set, or it is set to text/html

Method _ls_real tries to download the whole r.text() of a link
regardless of the type of HTML content. Prevent this download in all
cases except when Content-Type header is not set, or it is set to text/html
Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possible approach would be to stream-open the URL, and download the first chunk, looking for <html> or <!DOCTYPE html>. That would also give us the opportunity of setting an upper limit on the size of the download even for URLs with no type.

So, how about we only allow "text/html" for instant read, skip for all other types when given as you are doing, but for the case of no header, only read by chunks up to a maximum size?

links = [u[2] for u in ex.findall(text)]
except UnicodeDecodeError:
links = [] # binary, not HTML
url_info = await self._info(url, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an extra call, it would be better to look at r's headers.

links = ex2.findall(text) + [u[2] for u in ex.findall(text)]
else:
links = [u[2] for u in ex.findall(text)]
except UnicodeDecodeError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://www.w3schools.com/html/html_charset.asp , although utf8 and ascii are far dominant, other encodings are allowed and still in use.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That exception block was already there in the original code. While I'm certainly not an expert on Python's string decoding, it seems like UnicodeDecodeError is being thrown in any case where a string or sequence of bytes can't be decoded according to the given encoding: https://wiki.python.org/moin/UnicodeDecodeError so the catch seems appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using errors='ignore' should be the right thing to do here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stated purpose of catching the exception is to determine if the content is binary or HTML, would ignoring errors make sense in that case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data could be text, with HTML and links, but not UTF8

@amastilovic
Copy link
Author

amastilovic commented Jul 11, 2025

Another possible approach would be to stream-open the URL, and download the first chunk, looking for <html> or <!DOCTYPE html>. That would also give us the opportunity of setting an upper limit on the size of the download even for URLs with no type.

While I agree that this approach would be preferred, I unfortunately do not see a way to read only a chunk of bytes using aiohttp.client_reqrep.ClientResponse: https://github.com/aio-libs/aiohttp/blob/v3.12.13/aiohttp/client_reqrep.py#L682

Would it be OK to keep the current approach?

@martindurant
Copy link
Member

response.content.iter_chunks() or .read() with a fixed not-too-large number of bytes.

@amastilovic
Copy link
Author

response.content.iter_chunks() or .read() with a fixed not-too-large number of bytes.

OK that worked. The unit test for isdir() failed though, because the test case HTML content simply lists <a href> links with no <html> or <!DOCTYPE html>. Seems like determining whether content is HTML or not might be slightly more complicated :-) I could use a simple regex like (<([^>]+)>) which is simply looking for something between < and >, although even that might fail in case where content is plain text with some <a href> links within it.

Either that or we decide not to get into the business of parsing the content to determine if it's HTML or not.

* Get Content-Type from headers instead of another `.info()` call
* Use `r.text(errors="ignore")`
* Add a `test_isdir` case for when MIME type is present
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants