Closed
Description
read_csv gives unexpected behaviour with large files if a column contains both strings and integers. eg
>>> df=DataFrame({'colA':range(500000-1)+['apple', 'pear']+range(500000-1)})
len(set(df.colA))
500001
>>> df.to_csv('testpandas2.txt')
>>> df2=read_csv('testpandas2.txt')
>>> len(set(df2.colA))
762143
>>> pandas.__version__
'0.11.0'
It seems some of the integers are parsed as integers and others as strings.
>>> list(set(df2.colA))[-10:]
['282248', '282249', '282240', '282241', '282242', '15679', '282244', '282245', '282246', '282247']
>>> list(set(df2.colA))[:10]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]