-
Notifications
You must be signed in to change notification settings - Fork 396
Fix HTTPFileSystem isdir downloads the whole file issue #1889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix HTTPFileSystem isdir downloads the whole file issue #1889
Conversation
Method _ls_real tries to download the whole r.text() of a link regardless of the type of HTML content. Prevent this download in all cases except when Content-Type header is not set, or it is set to text/html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possible approach would be to stream-open the URL, and download the first chunk, looking for <html>
or <!DOCTYPE html>
. That would also give us the opportunity of setting an upper limit on the size of the download even for URLs with no type.
So, how about we only allow "text/html" for instant read, skip for all other types when given as you are doing, but for the case of no header, only read by chunks up to a maximum size?
fsspec/implementations/http.py
Outdated
links = [u[2] for u in ex.findall(text)] | ||
except UnicodeDecodeError: | ||
links = [] # binary, not HTML | ||
url_info = await self._info(url, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an extra call, it would be better to look at r
's headers.
links = ex2.findall(text) + [u[2] for u in ex.findall(text)] | ||
else: | ||
links = [u[2] for u in ex.findall(text)] | ||
except UnicodeDecodeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to https://www.w3schools.com/html/html_charset.asp , although utf8 and ascii are far dominant, other encodings are allowed and still in use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That exception block was already there in the original code. While I'm certainly not an expert on Python's string decoding, it seems like UnicodeDecodeError
is being thrown in any case where a string or sequence of bytes can't be decoded according to the given encoding: https://wiki.python.org/moin/UnicodeDecodeError so the catch seems appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using errors='ignore'
should be the right thing to do here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stated purpose of catching the exception is to determine if the content is binary or HTML, would ignoring errors make sense in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data could be text, with HTML and links, but not UTF8
While I agree that this approach would be preferred, I unfortunately do not see a way to read only a chunk of bytes using Would it be OK to keep the current approach? |
|
OK that worked. The unit test for Either that or we decide not to get into the business of parsing the content to determine if it's HTML or not. |
* Get Content-Type from headers instead of another `.info()` call * Use `r.text(errors="ignore")` * Add a `test_isdir` case for when MIME type is present
Method _ls_real tries to download the whole r.text() of a link regardless of the type of HTML content. Prevent this download in all cases except when Content-Type header is not set, or it is set to text/html