[BUGZILLA #17546] extend readtable with a hook for column type detection #6720

MichaelChirico · 2020-05-19T02:33:26Z

I would like an enhancement to readtable that avoids the builtin type detection.
I think this is best done using a new parameter that points to a function that will be called for each column and produces the converted column.
I particularly want to use this to read .csv files with most columns datetime stamps in DD/MM/YYYY HH:MM format. This is not very exotic.

I see no real alternatives to this approach, only incomplete fragments.

colConvert does relay on base.R implementation of as.POSIXct
as.POSIXct can be overriden, but read.table from base.R still uses the base.R copy.
type.Convert can be overriden, but read.table still uses the base.R copy.

The real, stupid alternative is to redefine read.table in my workspace.
I think this is not good practise.

METADATA

Bug author - Kurt Van Dijck
Creation time - 2019-03-28 05:09:53 UTC
Bugzilla link
Status - ASSIGNED
Alias - None
Component - I/O
Version - R 3.5.0
Hardware - All All
Importance - P5 enhancement
Assignee - R-core
URL -
Modification time - 2019-04-08 18:42 UTC

MichaelChirico · 2020-05-19T02:33:27Z

Please attach your patch and then we can iterate on it.

METADATA

Comment author - Michael Lawrence
Timestamp - 2019-03-28 05:15:06 UTC

MichaelChirico · 2020-05-19T02:33:28Z

Created attachment 2414 [details]
readtable: add hook for type conversions per column

METADATA

Comment author - Kurt Van Dijck
Timestamp - 2019-03-28 05:36:36 UTC

INCLUDED PATCH

ID - 4
Author - Kurt Van Dijck
Link to download patch - https://bugs.r-project.org/bugzilla/attachment.cgi?id=2414
Timestamp - 2019-03-28 05:36 UTC
Extra info - (2.66 KB, patch)

MichaelChirico · 2020-05-19T02:33:28Z

Created attachment 2415 [details]
readtable: add test for type conversion hook 'colConvert'

This patch illustrates the use of colConvert.
I applied a testvector based on an exisiting test.

METADATA

Comment author - Kurt Van Dijck
Timestamp - 2019-03-28 05:38:01 UTC

INCLUDED PATCH

ID - 5
Author - Kurt Van Dijck
Link to download patch - https://bugs.r-project.org/bugzilla/attachment.cgi?id=2415
Timestamp - 2019-03-28 05:38 UTC
Extra info - (2.90 KB, patch)

MichaelChirico · 2020-05-19T02:33:29Z

Created attachment 2416 [details]
[v2] readtable add hook

This is my v2 of the patch.
It respects 'fail early and loud'.
Since the legacy conversion function is moved into it's own function,
it's available for custom conversions as a fallback, if desired.

METADATA

Comment author - Kurt Van Dijck
Timestamp - 2019-03-28 14:43:02 UTC

INCLUDED PATCH

ID - 6
Author - Kurt Van Dijck
Link to download patch - https://bugs.r-project.org/bugzilla/attachment.cgi?id=2416
Timestamp - 2019-03-28 14:43 UTC
Extra info - (3.04 KB, patch)

MichaelChirico · 2020-05-19T02:33:30Z

Created attachment 2417 [details]
[v2] test for readtable

the test for v2 patch.
Please note that the v2 patch is not yet functional, it serves more as an illustration of what I plan currently after the initial comments on v1

METADATA

Comment author - Kurt Van Dijck
Timestamp - 2019-03-28 14:44:54 UTC

INCLUDED PATCH

ID - 7
Author - Kurt Van Dijck
Link to download patch - https://bugs.r-project.org/bugzilla/attachment.cgi?id=2417
Timestamp - 2019-03-28 14:44 UTC
Extra info - (2.95 KB, patch)

MichaelChirico · 2020-05-19T02:33:31Z

Created attachment 2418 [details]
[v3] readtable: add hook for type conversions per column

This v3 version combines code & test in 1 patch.
It compiles, runs tests, and is functional.

In my opinion, the API change is correct.
I needed to use 'global variables' to access the legacy parameters
to read.table that control the default conversion from within the seperated
read.table.colConvert.builtin function.
This looks strange, but allows to the builtin function to have the same prototype.
Those details are not exported, only the result, which looks clean.

METADATA

Comment author - Kurt Van Dijck
Timestamp - 2019-03-28 19:58:05 UTC

INCLUDED PATCH

ID - 8
Author - Kurt Van Dijck
Link to download patch - https://bugs.r-project.org/bugzilla/attachment.cgi?id=2418
Timestamp - 2019-03-28 19:58 UTC
Extra info - (6.95 KB, patch)

MichaelChirico · 2020-05-19T02:33:31Z

There are two basic patterns to override fallback:

The override returns a sentinel value (NULL might work in this case) to signal the caller to follow the fallback path. This has the advantage of being simple to implement.
The override delegates to the fallback. This is a very common approach, seen in most object-oriented systems, see NextMethod() and callNextMethod() in R, for example. The advantage is that the delegation is obvious to anyone reading the code, and there is more flexibility since the override completely controls the call to the overridden function.

As you've discovered, the downside to #2 is that there is some state that has to be passed to the delegate. For example, with callNextMethod(), the name of the generic and other information is needed to find the next method (it gets those from the calling frames, which is complicated and not recommended). State should be localized to the extent possible. Global variables add a lot of complexity.

One option would to pass a closure as a second argument to the coercion function. The function would pass the data (potentially transformed) to the closure to execute the default behavior. The closure would take the place of the index argument, which would no longer be needed.

METADATA

Comment author - Michael Lawrence
Timestamp - 2019-03-28 20:22:02 UTC

MichaelChirico · 2020-05-19T02:33:32Z

pattern #2, delegation to the fallback, makes the core code very
readable, and indeed deserves preference.

The downside in this situation is the legacy set of parameters that have
to be carried to the delegate, to remain compatible.
I don't understand all internals in R, but in my C experience, having a
(not so) global variable to work around this is not complicated in a
single-threaded application.

Defining a closure as argument to the override must be coded in
read.table in order to maintain backward compatibility (I think), so
this makes the builtin default different from the override, or I don't
get the fine details of a closure :-) .

I don't feel like creating the closure in the user workspace is a
trivial thing to do either.

On the other hand, I pass the index argument to the benefit of the
coercion function. I personally have not yet interest in this, but
having it 'promises the sky' in terms of what a user could achieve.

If having the fallback as a closure would work, I'd propose 3 arguments,
the dataset, the index & the closure.

METADATA

Comment author - Kurt Van Dijck
Timestamp - 2019-03-28 21:31:54 UTC

MichaelChirico · 2020-05-19T02:33:33Z

Every argument adds complexity to the contract.

I just noticed that the patch only accepts a single function for coercion. Shouldn't the user be able to specify a function for each column? It could recycle for convenience. Names on the list could correspond to column names.

METADATA

Comment author - Michael Lawrence
Timestamp - 2019-03-28 21:51:02 UTC

MichaelChirico · 2020-05-19T02:33:34Z

My colleagues at MOTUS typically try to digest timeuse files in .csv with

1 or 2 columns metadata: id, uuid, ...
several 100! columns with either datetime or duration.

Defining for each of their file a list of conversion per column is strict but error-prone, and higly subject to copy-paste problems.

Having 1 function that probes 1 of 2 well-defined formats ressembles much more the current behaviour (ie. distinguishing numbers from factor etc) and contributes to the usability.

Every argument adds complexity to the contract.

It felt logical with my C background.
I'm indeed not well aware of that kind of concerns in R.
I'll drop it.

BTW, I looked at closures, but I did not immediately get how that would solve my problem avoiding global variables. Is there a good example that duckduckgo missed.

METADATA

Comment author - Kurt Van Dijck
Timestamp - 2019-03-29 08:51:34 UTC

MichaelChirico · 2020-05-19T02:33:35Z

Created attachment 2420 [details]
[v4] readtable: add hook for type conversions per column

V4 of the patch replaces the index argument with a fallback closure.

METADATA

Comment author - Kurt Van Dijck
Timestamp - 2019-03-29 14:22:37 UTC

INCLUDED PATCH

ID - 0
Author - Kurt Van Dijck
Link to download patch - https://bugs.r-project.org/bugzilla/attachment.cgi?id=2420
Timestamp - 2019-03-29 14:22 UTC
Extra info - (5.53 KB, patch)

MichaelChirico · 2020-05-19T02:33:35Z

I'm having some doubts about the use case. Not sure how typical it is to have a large number of identically typed columns to convert. It might be better to write a reusable function (ideally in a package) for your colleagues that parses the relevant file format, converting after reading.

METADATA

Comment author - Michael Lawrence
Timestamp - 2019-04-06 01:51:09 UTC

MichaelChirico · 2020-05-19T02:33:36Z

Thanks for taking time to review it.
It certainly helped for creating such reusable function that converts after reading.

METADATA

Comment author - Kurt Van Dijck
Timestamp - 2019-04-08 18:42:53 UTC

MichaelChirico added Component - I/O Hardware - All All Importance - P5 enhancement Status - ASSIGNED Version - R 3.5.0 labels May 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUGZILLA #17546] extend readtable with a hook for column type detection #6720

[BUGZILLA #17546] extend readtable with a hook for column type detection #6720

MichaelChirico commented May 19, 2020

MichaelChirico commented May 19, 2020

MichaelChirico commented May 19, 2020

MichaelChirico commented May 19, 2020

MichaelChirico commented May 19, 2020

MichaelChirico commented May 19, 2020

MichaelChirico commented May 19, 2020

MichaelChirico commented May 19, 2020

MichaelChirico commented May 19, 2020

MichaelChirico commented May 19, 2020

MichaelChirico commented May 19, 2020

MichaelChirico commented May 19, 2020

MichaelChirico commented May 19, 2020

MichaelChirico commented May 19, 2020

[BUGZILLA #17546] extend readtable with a hook for column type detection #6720

[BUGZILLA #17546] extend readtable with a hook for column type detection #6720

Comments

MichaelChirico commented May 19, 2020

METADATA

MichaelChirico commented May 19, 2020

METADATA

MichaelChirico commented May 19, 2020

METADATA

INCLUDED PATCH

MichaelChirico commented May 19, 2020

METADATA

INCLUDED PATCH

MichaelChirico commented May 19, 2020

METADATA

INCLUDED PATCH

MichaelChirico commented May 19, 2020

METADATA

INCLUDED PATCH

MichaelChirico commented May 19, 2020

METADATA

INCLUDED PATCH

MichaelChirico commented May 19, 2020

METADATA

MichaelChirico commented May 19, 2020

METADATA

MichaelChirico commented May 19, 2020

METADATA

MichaelChirico commented May 19, 2020

METADATA

MichaelChirico commented May 19, 2020

METADATA

INCLUDED PATCH

MichaelChirico commented May 19, 2020

METADATA

MichaelChirico commented May 19, 2020

METADATA