-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Description
When using download_protected_document() (or other file download methods), an AttributeError: 'NoneType' object has no attribute 'group' can occur intermittently if the server returns a Content-Disposition header that doesn't match the expected regex pattern.
Environment
- Library version: 6.2.0 (latest)
- Python version: 3.12
Root Cause
The issue is in model_utils.py at line 1399-1401 in the deserialize_file() function:
if content_disposition:
filename = re.search(r'filename=[\'"]?([^\'"\s]+)[\'"]?',
content_disposition).group(1)
path = os.path.join(os.path.dirname(path), filename)The code checks if content_disposition is truthy, but does not check if re.search() actually finds a match. When the header exists but doesn't contain filename= in the expected format, re.search() returns None and calling .group(1) on None raises AttributeError.
Scenarios That Can Cause This
Content-Disposition: attachment(no filename parameter)Content-Disposition: attachment; filename*=UTF-8''encoded%20name(RFC 5987 encoding)- Other non-standard header formats
Error Message
AttributeError: 'NoneType' object has no attribute 'group'
Suggested Fix
Add a null check before calling .group():
if content_disposition:
match = re.search(r'filename=[\'"]?([^\'"\s]+)[\'"]?', content_disposition)
if match:
filename = match.group(1)
path = os.path.join(os.path.dirname(path), filename)Or optionally, also handle RFC 5987 encoded filenames:
if content_disposition:
# Try standard filename first
match = re.search(r'filename=[\'"]?([^\'"\s]+)[\'"]?', content_disposition)
if not match:
# Try RFC 5987 encoded filename (filename*=)
match = re.search(r"filename\*=(?:UTF-8''|utf-8'')([^;\s]+)", content_disposition)
if match:
filename = match.group(1)
# URL decode if needed for RFC 5987
if 'filename*=' in content_disposition:
from urllib.parse import unquote
filename = unquote(filename)
path = os.path.join(os.path.dirname(path), filename)Workaround
As a temporary workaround, users can pass _preload_content=False to file download methods to bypass the deserialization:
response = api.download_protected_document(id=document_id, _preload_content=False)
content = response.read()This returns the raw urllib3.HTTPResponse object instead of going through deserialize_file().
Additional Context
This is a common bug pattern - similar issues have been reported and fixed in other libraries that parse Content-Disposition headers (e.g., gdown, requests-toolbelt).