Skip to content

BUG: Unexpected behaviour when reading large text files with mixed datatypes #3866

Closed
@martingoodson

Description

@martingoodson

read_csv gives unexpected behaviour with large files if a column contains both strings and integers. eg

>>> df=DataFrame({'colA':range(500000-1)+['apple', 'pear']+range(500000-1)})
len(set(df.colA))
500001

>>> df.to_csv('testpandas2.txt')
>>> df2=read_csv('testpandas2.txt')
>>> len(set(df2.colA))
762143

 >>> pandas.__version__
'0.11.0'

It seems some of the integers are parsed as integers and others as strings.

>>> list(set(df2.colA))[-10:]
['282248', '282249', '282240', '282241', '282242', '15679', '282244', '282245', '282246', '282247']
>>> list(set(df2.colA))[:10]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIO CSVread_csv, to_csvIO DataIO issues that don't fit into a more specific label

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions