Skip to content

read_fwf does not support dtype argument #7141

Closed
@brendon9x

Description

@brendon9x

The documentation implies that you can supply a dtype dict to read_fwf, but in reality this option is silently dropped as it looks like it's only supported by the c parser. My specific use case is loading Triple-S files which are fairly prolific in the market research world. The Triple-S standard is basically an XML file which describes a fixed width file. It's fairly trivial to get this to work in Pandas in a few lines of code which is great.

The problem arises when these files become really large. I tried using chunked conversion to HDF5 using append_to_multiple but ran into a baffling problem of certain chunks failing on append. Stepping through the code, it looked like the underlying block layouts where different per chunk. And this in turn was caused by the fact that column inference is applied per chunk and dtypes are ignored. I suspect this is caused by data being missing in some chunks and not in others.

The lowest hanging fruit is to update that docs and I'm happy to do a PR for this. But it would be awesome if the c parser could be tweaked to allow reading fixed width files as this issue would go away and we'd get a huge speed boost. This initially looked hard, but then started to look like adding a simpler statemachine might be possible if colspecs could be passed in. I could possibly do this as a PR, but would probably need some PR hand holding. Finally, if you could point out a simple place to apply the dtype argument in the read_fwf parser, I can give it a go as a second best case PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions