Skip to content

Usefulness of error_bad_lines when dtypes are defined #20573

Closed
@lgmoneda

Description

@lgmoneda

Code Sample, a copy-pastable example if possible

import pandas as pd
from numpy import dtype

### Create the sample data
with open('data.csv', 'w+') as file:

    file.write("ID,X1,X2,X3\n")
    file.write("0,1,Amigo,3\n")
    file.write("1,1,Inimigo, amor,9\n")
    file.write("2,1,Cowboy,42\n") 
 
file.close()

dtypes = {"ID": dtype("int64"),
         "X1": dtype("int64"),
         "X2": dtype("O"),
         "X3": dtype("int64")}

print("Load df with no params: ", end="")
try:
    df = pd.read_csv("data.csv")
    print("Sucess")
except:
    print("Fail")

print("Load df with error bad lines: ", end="")
try:
    df = pd.read_csv("data.csv", error_bad_lines=False)
    print("Sucess")
except:
    print("Fail")

print("Load df with error bad lines and dtypes: ", end="")    
try:
    df = pd.read_csv("data.csv", error_bad_lines=False, dtype=dtypes)
    print("Sucess")
except:
    print("Fail")

Problem description

The problem is that error_bad_lines is pretty useful to deal with undesired commas inside the data that splits a single column into two new ones. But when dtype is defined, it checks the type of each column before skipping a problematic row, causing it to not match.

I'd argue that the row should be skipped before checking the dtype, because when a problematic row appears its dtypes are messed.

Expected Output

It should skip the problematic row even when dtype is passed as param.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Darwin OS-release: 17.3.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_BR.UTF-8 LOCALE: None.None

pandas: 0.22.0
pytest: 3.0.5
pip: 9.0.1
setuptools: 36.5.0
Cython: None
numpy: 1.14.2
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.14
pymysql: None
psycopg2: 2.7.3.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: 0.1.4
fastparquet: None
pandas_gbq: None
pandas_datareader: None

update: I've updated my pandas version.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Error ReportingIncorrect or improved errors from pandas

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions