Fix HTTPFileSystem isdir downloads the whole file issue #1889

amastilovic · 2025-07-10T22:26:08Z

Method _ls_real tries to download the whole r.text() of a link regardless of the type of HTML content. Prevent this download in all cases except when Content-Type header is not set, or it is set to text/html

martindurant

Another possible approach would be to stream-open the URL, and download the first chunk, looking for <html> or <!DOCTYPE html>. That would also give us the opportunity of setting an upper limit on the size of the download even for URLs with no type.

So, how about we only allow "text/html" for instant read, skip for all other types when given as you are doing, but for the case of no header, only read by chunks up to a maximum size?

martindurant · 2025-07-11T14:08:34Z

fsspec/implementations/http.py

-                    links = [u[2] for u in ex.findall(text)]
-            except UnicodeDecodeError:
-                links = []  # binary, not HTML
+            url_info = await self._info(url, **kwargs)


This is an extra call, it would be better to look at r's headers.

martindurant · 2025-07-11T14:09:24Z

fsspec/implementations/http.py

+                        links = ex2.findall(text) + [u[2] for u in ex.findall(text)]
+                    else:
+                        links = [u[2] for u in ex.findall(text)]
+                except UnicodeDecodeError:


According to https://www.w3schools.com/html/html_charset.asp , although utf8 and ascii are far dominant, other encodings are allowed and still in use.

That exception block was already there in the original code. While I'm certainly not an expert on Python's string decoding, it seems like UnicodeDecodeError is being thrown in any case where a string or sequence of bytes can't be decoded according to the given encoding: https://wiki.python.org/moin/UnicodeDecodeError so the catch seems appropriate.

Using errors='ignore' should be the right thing to do here

The stated purpose of catching the exception is to determine if the content is binary or HTML, would ignoring errors make sense in that case?

The data could be text, with HTML and links, but not UTF8

amastilovic · 2025-07-11T16:01:11Z

Another possible approach would be to stream-open the URL, and download the first chunk, looking for <html> or <!DOCTYPE html>. That would also give us the opportunity of setting an upper limit on the size of the download even for URLs with no type.

While I agree that this approach would be preferred, I unfortunately do not see a way to read only a chunk of bytes using aiohttp.client_reqrep.ClientResponse: https://github.com/aio-libs/aiohttp/blob/v3.12.13/aiohttp/client_reqrep.py#L682

Would it be OK to keep the current approach?

martindurant · 2025-07-11T16:13:42Z

response.content.iter_chunks() or .read() with a fixed not-too-large number of bytes.

amastilovic · 2025-07-11T16:38:09Z

response.content.iter_chunks() or .read() with a fixed not-too-large number of bytes.

OK that worked. The unit test for isdir() failed though, because the test case HTML content simply lists <a href> links with no <html> or <!DOCTYPE html>. Seems like determining whether content is HTML or not might be slightly more complicated :-) I could use a simple regex like (<([^>]+)>) which is simply looking for something between < and >, although even that might fail in case where content is plain text with some <a href> links within it.

Either that or we decide not to get into the business of parsing the content to determine if it's HTML or not.

* Get Content-Type from headers instead of another `.info()` call * Use `r.text(errors="ignore")` * Add a `test_isdir` case for when MIME type is present

Fix HTTPFileSystem isdir downloads the whole file issue

4cd61eb

Method _ls_real tries to download the whole r.text() of a link regardless of the type of HTML content. Prevent this download in all cases except when Content-Type header is not set, or it is set to text/html

amastilovic mentioned this pull request Jul 10, 2025

HTTPFileSystem isdir downloads the whole file #1707

Open

martindurant reviewed Jul 11, 2025

View reviewed changes

Address review comments, add unit test case

49b6e90

* Get Content-Type from headers instead of another `.info()` call * Use `r.text(errors="ignore")` * Add a `test_isdir` case for when MIME type is present

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix HTTPFileSystem isdir downloads the whole file issue #1889

Fix HTTPFileSystem isdir downloads the whole file issue #1889

Uh oh!

amastilovic commented Jul 10, 2025

Uh oh!

martindurant left a comment

Uh oh!

martindurant Jul 11, 2025

Uh oh!

martindurant Jul 11, 2025

Uh oh!

amastilovic Jul 11, 2025

Uh oh!

martindurant Jul 11, 2025

Uh oh!

amastilovic Jul 11, 2025

Uh oh!

martindurant Jul 11, 2025

Uh oh!

amastilovic commented Jul 11, 2025 •

edited

Loading

Uh oh!

martindurant commented Jul 11, 2025

Uh oh!

amastilovic commented Jul 11, 2025

Uh oh!

Uh oh!

Fix HTTPFileSystem isdir downloads the whole file issue #1889

Are you sure you want to change the base?

Fix HTTPFileSystem isdir downloads the whole file issue #1889

Uh oh!

Conversation

amastilovic commented Jul 10, 2025

Uh oh!

martindurant left a comment

Choose a reason for hiding this comment

Uh oh!

martindurant Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

martindurant Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

amastilovic Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

martindurant Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

amastilovic Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

martindurant Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

amastilovic commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Jul 11, 2025

Uh oh!

amastilovic commented Jul 11, 2025

Uh oh!

Uh oh!

amastilovic commented Jul 11, 2025 •

edited

Loading