Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
# models.py
from django.db import models
class MyModel(models.Model):
f = models.FileField()
# views.py
from io import StringIO
from django.http import HttpResponse
import tempfile
from pandas import read_csv
from .models import MyModel
def try_it(path_or_filelike_object):
try:
print(read_csv(
path_or_filelike_object,
encoding="utf-16le",
sep="|",
))
except Exception as e:
print(e)
def failme(request):
utf_16le_encoded = b'\xff\xfef\x00o\x00o\x00|\x00b\x00a\x00r\x00\r\x00\n\x000\x00|\x001\x00\r\x00\n\x00'
print(utf_16le_encoded.decode("utf-16le"))
f = tempfile.NamedTemporaryFile(mode='w+b')
f.write(utf_16le_encoded)
f.flush()
f.seek(0)
print("\n## Just use path")
try_it(f.name)
print("\n## A python file object")
try_it(f)
MyModel.objects.all().delete()
mm = MyModel.objects.create(f="filename")
with mm.f.open('wb+') as destination:
destination.write(utf_16le_encoded)
print("\n## A Django FieldFile")
try_it(mm.f.open('rb'))
print("\n## A Django FieldFile, wrapped in StringIO")
try_it(StringIO(mm.f.open('rb').read().decode("utf-16le")))
return HttpResponse("look in your console")
What is printed:
foo|bar
0|1
## Just use path
foo bar
0 0 1
## A python file object
foo bar
0 0 1
## A Django FieldFile
'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
## A Django FieldFile, wrapped in StringIO
foo bar
0 0 1
Issue Description
read_csv does not always take the provided encoding into account for file-like objects.
An example is given in the above for a Django FieldFile (FileField), but I suspect the issue is more general.
A small utf-16le file is fed to read_csv
via a path, regular python file, Django fieldfile and StringIO; It fails with an encoding error in the 3rd case only.
This seems similar to #31819, although that bug is reportedly fixed
Expected Behavior
read_csv should always take the provided encoding into account correctly.
Installed Versions
INSTALLED VERSIONS
commit : 66e3805
python : 3.9.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.8-76051508-generic
Version : #202112141040163950527821.10~0ede46a SMP Tue Dec 14 22:38:29 U
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.3.5
numpy : 1.22.1
pytz : 2021.3
dateutil : 2.8.2
pip : 20.3.4
setuptools : 44.1.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None