Skip to content

BUG: read_csv does not take encoding into account for (e.g.) Django FieldFile #45488

Open
@vanschelven

Description

@vanschelven

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# models.py
from django.db import models


class MyModel(models.Model):
    f = models.FileField()

# views.py
from io import StringIO
from django.http import HttpResponse

import tempfile
from pandas import read_csv

from .models import MyModel


def try_it(path_or_filelike_object):
    try:
        print(read_csv(
            path_or_filelike_object,
            encoding="utf-16le",
            sep="|",
            ))
    except Exception as e:
        print(e)


def failme(request):
    utf_16le_encoded = b'\xff\xfef\x00o\x00o\x00|\x00b\x00a\x00r\x00\r\x00\n\x000\x00|\x001\x00\r\x00\n\x00'
    print(utf_16le_encoded.decode("utf-16le"))

    f = tempfile.NamedTemporaryFile(mode='w+b')
    f.write(utf_16le_encoded)
    f.flush()
    f.seek(0)

    print("\n## Just use path")
    try_it(f.name)

    print("\n## A python file object")
    try_it(f)

    MyModel.objects.all().delete()
    mm = MyModel.objects.create(f="filename")
    with mm.f.open('wb+') as destination:
        destination.write(utf_16le_encoded)

    print("\n## A Django FieldFile")
    try_it(mm.f.open('rb'))

    print("\n## A Django FieldFile, wrapped in StringIO")
    try_it(StringIO(mm.f.open('rb').read().decode("utf-16le")))

    return HttpResponse("look in your console")

What is printed:
foo|bar
0|1


## Just use path
   foo  bar
0    0    1

## A python file object
   foo  bar
0    0    1

## A Django FieldFile
'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

## A Django FieldFile, wrapped in StringIO
   foo  bar
0    0    1

Issue Description

read_csv does not always take the provided encoding into account for file-like objects.
An example is given in the above for a Django FieldFile (FileField), but I suspect the issue is more general.
A small utf-16le file is fed to read_csv via a path, regular python file, Django fieldfile and StringIO; It fails with an encoding error in the 3rd case only.

This seems similar to #31819, although that bug is reportedly fixed

Expected Behavior

read_csv should always take the provided encoding into account correctly.

Installed Versions

INSTALLED VERSIONS

commit : 66e3805
python : 3.9.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.8-76051508-generic
Version : #202112141040163950527821.10~0ede46a SMP Tue Dec 14 22:38:29 U
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.5
numpy : 1.22.1
pytz : 2021.3
dateutil : 2.8.2
pip : 20.3.4
setuptools : 44.1.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions