[BUGZILLA #16615] Allow multi-character separator in scan() #6002

MichaelChirico · 2020-05-18T18:42:06Z

Overview:

The scan() function, used by the read functions and others, only allows a single byte as separator (sep). While this is fine for simple datasets, it's problematic for complex data sets such as those that might contain lengthy comments, byte dumps, embedded newlines, and so on.

It would be very helpful to be able to specify a multi-character separator.

Steps to Reproduce: (Borrowed from the documentation for base::scan())

sep: by default, scan expects to read ‘white-space’ delimited input fields. Alternatively, sep can be used to specify a character which delimits fields. A field is always delimited by an end-of-line marker unless it is quoted. If specified this should be the empty character string (the default) or NULL or a character string containing just one single-byte character.

Actual Results: Only a single separator character can be used.

Expected (desired) Results: Multiple separator characters can be used. This is particularly important for cases where text files with arbitrary characters, possibly with unmatched quotes (of any type) are being parsed. The only way to handle such files is to create a ridiculously long separator that is not likely to show up in the dataset, such as sep='|%$!?MYSEPARATOR?!$%|', and work with that. Incidentally, the write functions allow multi-character separators, but their read counterparts do not (because of the scan() limitation).

Build Date & Hardware & Additional Builds and Platforms: All (Base).

Additional Information:

This issue has been challenging for many R users who are working on data with either nonstandard separators or arbitrary character content. In the case of nonstandard separators, most of them have resorted to pre-processing outside of R, which is a shame. However, in the case of arbitrary character content parsing, external parsing doesn't resolve the issue, since reading back into R simply recreates the issue. Further, stringsplit()/readLines() based solutions are fine except for the parsing of data with embedded newlines, which also becomes a problem. The ability to set an arbitrary separator with write() already exists to address this issue. We just need the ability to read arbitrary separators as a counterpart.

Here are some of the threads that I found while searching for solutions to this issue. Note that none of them address the arbitrary data parsing cases with embedded newlines:

http://stackoverflow.com/questions/18186357/importing-csv-file-with-multiple-character-separator-to-r

http://r.789695.n4.nabble.com/multiple-separators-in-sep-argument-for-read-table-td856567.html

http://stackoverflow.com/questions/7883859/how-to-read-a-text-file-into-gnu-r-with-a-multiple-byte-separator

https://www.mail-archive.com/r-help@<::CENSORED -- SEE ORIGINAL ON BUGZILLA::>-project.org/msg18035.html

http://stackoverflow.com/questions/2732397/why-the-field-separator-character-must-be-only-one-byte

http://stackoverflow.com/questions/17223844/one-byte-separator-argument-in-read-table

http://stackoverflow.com/questions/29740992/r-invalid-sep-value-must-be-one-byte