Description
The documentation implies that you can supply a dtype
dict to read_fwf
, but in reality this option is silently dropped as it looks like it's only supported by the c parser. My specific use case is loading Triple-S files which are fairly prolific in the market research world. The Triple-S standard is basically an XML file which describes a fixed width file. It's fairly trivial to get this to work in Pandas in a few lines of code which is great.
The problem arises when these files become really large. I tried using chunked conversion to HDF5 using append_to_multiple
but ran into a baffling problem of certain chunks failing on append. Stepping through the code, it looked like the underlying block layouts where different per chunk. And this in turn was caused by the fact that column inference is applied per chunk and dtypes are ignored. I suspect this is caused by data being missing in some chunks and not in others.
The lowest hanging fruit is to update that docs and I'm happy to do a PR for this. But it would be awesome if the c parser could be tweaked to allow reading fixed width files as this issue would go away and we'd get a huge speed boost. This initially looked hard, but then started to look like adding a simpler statemachine might be possible if colspecs could be passed in. I could possibly do this as a PR, but would probably need some PR hand holding. Finally, if you could point out a simple place to apply the dtype argument in the read_fwf parser, I can give it a go as a second best case PR.