Description
Code Sample, a copy-pastable example if possible
import pandas as pd
from numpy import dtype
### Create the sample data
with open('data.csv', 'w+') as file:
file.write("ID,X1,X2,X3\n")
file.write("0,1,Amigo,3\n")
file.write("1,1,Inimigo, amor,9\n")
file.write("2,1,Cowboy,42\n")
file.close()
dtypes = {"ID": dtype("int64"),
"X1": dtype("int64"),
"X2": dtype("O"),
"X3": dtype("int64")}
print("Load df with no params: ", end="")
try:
df = pd.read_csv("data.csv")
print("Sucess")
except:
print("Fail")
print("Load df with error bad lines: ", end="")
try:
df = pd.read_csv("data.csv", error_bad_lines=False)
print("Sucess")
except:
print("Fail")
print("Load df with error bad lines and dtypes: ", end="")
try:
df = pd.read_csv("data.csv", error_bad_lines=False, dtype=dtypes)
print("Sucess")
except:
print("Fail")
Problem description
The problem is that error_bad_lines
is pretty useful to deal with undesired commas inside the data that splits a single column into two new ones. But when dtype
is defined, it checks the type of each column before skipping a problematic row, causing it to not match.
I'd argue that the row should be skipped before checking the dtype, because when a problematic row appears its dtypes are messed.
Expected Output
It should skip the problematic row even when dtype is passed as param.
Output of pd.show_versions()
pandas: 0.22.0
pytest: 3.0.5
pip: 9.0.1
setuptools: 36.5.0
Cython: None
numpy: 1.14.2
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.14
pymysql: None
psycopg2: 2.7.3.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: 0.1.4
fastparquet: None
pandas_gbq: None
pandas_datareader: None
update: I've updated my pandas version.