Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGZILLA #17546] extend readtable with a hook for column type detection #6720

Open
MichaelChirico opened this issue May 19, 2020 · 13 comments

Comments

@MichaelChirico
Copy link
Owner

I would like an enhancement to readtable that avoids the builtin type detection.
I think this is best done using a new parameter that points to a function that will be called for each column and produces the converted column.
I particularly want to use this to read .csv files with most columns datetime stamps in DD/MM/YYYY HH:MM format. This is not very exotic.

I see no real alternatives to this approach, only incomplete fragments.

  • colConvert does relay on base.R implementation of as.POSIXct
  • as.POSIXct can be overriden, but read.table from base.R still uses the base.R copy.
  • type.Convert can be overriden, but read.table still uses the base.R copy.

The real, stupid alternative is to redefine read.table in my workspace.
I think this is not good practise.


METADATA

  • Bug author - Kurt Van Dijck
  • Creation time - 2019-03-28 05:09:53 UTC
  • Bugzilla link
  • Status - ASSIGNED
  • Alias - None
  • Component - I/O
  • Version - R 3.5.0
  • Hardware - All All
  • Importance - P5 enhancement
  • Assignee - R-core
  • URL -
  • Modification time - 2019-04-08 18:42 UTC
@MichaelChirico
Copy link
Owner Author

Please attach your patch and then we can iterate on it.


METADATA

  • Comment author - Michael Lawrence
  • Timestamp - 2019-03-28 05:15:06 UTC

@MichaelChirico
Copy link
Owner Author

Created attachment 2414 [details]
readtable: add hook for type conversions per column


METADATA

  • Comment author - Kurt Van Dijck
  • Timestamp - 2019-03-28 05:36:36 UTC

INCLUDED PATCH

@MichaelChirico
Copy link
Owner Author

Created attachment 2415 [details]
readtable: add test for type conversion hook 'colConvert'

This patch illustrates the use of colConvert.
I applied a testvector based on an exisiting test.


METADATA

  • Comment author - Kurt Van Dijck
  • Timestamp - 2019-03-28 05:38:01 UTC

INCLUDED PATCH

@MichaelChirico
Copy link
Owner Author

Created attachment 2416 [details]
[v2] readtable add hook

This is my v2 of the patch.
It respects 'fail early and loud'.
Since the legacy conversion function is moved into it's own function,
it's available for custom conversions as a fallback, if desired.


METADATA

  • Comment author - Kurt Van Dijck
  • Timestamp - 2019-03-28 14:43:02 UTC

INCLUDED PATCH

@MichaelChirico
Copy link
Owner Author

Created attachment 2417 [details]
[v2] test for readtable

the test for v2 patch.
Please note that the v2 patch is not yet functional, it serves more as an illustration of what I plan currently after the initial comments on v1


METADATA

  • Comment author - Kurt Van Dijck
  • Timestamp - 2019-03-28 14:44:54 UTC

INCLUDED PATCH

@MichaelChirico
Copy link
Owner Author

Created attachment 2418 [details]
[v3] readtable: add hook for type conversions per column

This v3 version combines code & test in 1 patch.
It compiles, runs tests, and is functional.

In my opinion, the API change is correct.
I needed to use 'global variables' to access the legacy parameters
to read.table that control the default conversion from within the seperated
read.table.colConvert.builtin function.
This looks strange, but allows to the builtin function to have the same prototype.
Those details are not exported, only the result, which looks clean.


METADATA

  • Comment author - Kurt Van Dijck
  • Timestamp - 2019-03-28 19:58:05 UTC

INCLUDED PATCH

@MichaelChirico
Copy link
Owner Author

There are two basic patterns to override fallback:

  1. The override returns a sentinel value (NULL might work in this case) to signal the caller to follow the fallback path. This has the advantage of being simple to implement.

  2. The override delegates to the fallback. This is a very common approach, seen in most object-oriented systems, see NextMethod() and callNextMethod() in R, for example. The advantage is that the delegation is obvious to anyone reading the code, and there is more flexibility since the override completely controls the call to the overridden function.

As you've discovered, the downside to #2 is that there is some state that has to be passed to the delegate. For example, with callNextMethod(), the name of the generic and other information is needed to find the next method (it gets those from the calling frames, which is complicated and not recommended). State should be localized to the extent possible. Global variables add a lot of complexity.

One option would to pass a closure as a second argument to the coercion function. The function would pass the data (potentially transformed) to the closure to execute the default behavior. The closure would take the place of the index argument, which would no longer be needed.


METADATA

  • Comment author - Michael Lawrence
  • Timestamp - 2019-03-28 20:22:02 UTC

@MichaelChirico
Copy link
Owner Author

pattern #2, delegation to the fallback, makes the core code very
readable, and indeed deserves preference.

The downside in this situation is the legacy set of parameters that have
to be carried to the delegate, to remain compatible.
I don't understand all internals in R, but in my C experience, having a
(not so) global variable to work around this is not complicated in a
single-threaded application.

Defining a closure as argument to the override must be coded in
read.table in order to maintain backward compatibility (I think), so
this makes the builtin default different from the override, or I don't
get the fine details of a closure :-) .

I don't feel like creating the closure in the user workspace is a
trivial thing to do either.

On the other hand, I pass the index argument to the benefit of the
coercion function. I personally have not yet interest in this, but
having it 'promises the sky' in terms of what a user could achieve.

If having the fallback as a closure would work, I'd propose 3 arguments,
the dataset, the index & the closure.


METADATA

  • Comment author - Kurt Van Dijck
  • Timestamp - 2019-03-28 21:31:54 UTC

@MichaelChirico
Copy link
Owner Author

Every argument adds complexity to the contract.

I just noticed that the patch only accepts a single function for coercion. Shouldn't the user be able to specify a function for each column? It could recycle for convenience. Names on the list could correspond to column names.


METADATA

  • Comment author - Michael Lawrence
  • Timestamp - 2019-03-28 21:51:02 UTC

@MichaelChirico
Copy link
Owner Author

My colleagues at MOTUS typically try to digest timeuse files in .csv with

  • 1 or 2 columns metadata: id, uuid, ...
  • several 100! columns with either datetime or duration.

Defining for each of their file a list of conversion per column is strict but error-prone, and higly subject to copy-paste problems.

Having 1 function that probes 1 of 2 well-defined formats ressembles much more the current behaviour (ie. distinguishing numbers from factor etc) and contributes to the usability.

Every argument adds complexity to the contract. 

It felt logical with my C background.
I'm indeed not well aware of that kind of concerns in R.
I'll drop it.

BTW, I looked at closures, but I did not immediately get how that would solve my problem avoiding global variables. Is there a good example that duckduckgo missed.


METADATA

  • Comment author - Kurt Van Dijck
  • Timestamp - 2019-03-29 08:51:34 UTC

@MichaelChirico
Copy link
Owner Author

Created attachment 2420 [details]
[v4] readtable: add hook for type conversions per column

V4 of the patch replaces the index argument with a fallback closure.


METADATA

  • Comment author - Kurt Van Dijck
  • Timestamp - 2019-03-29 14:22:37 UTC

INCLUDED PATCH

@MichaelChirico
Copy link
Owner Author

I'm having some doubts about the use case. Not sure how typical it is to have a large number of identically typed columns to convert. It might be better to write a reusable function (ideally in a package) for your colleagues that parses the relevant file format, converting after reading.


METADATA

  • Comment author - Michael Lawrence
  • Timestamp - 2019-04-06 01:51:09 UTC

@MichaelChirico
Copy link
Owner Author

Thanks for taking time to review it.
It certainly helped for creating such reusable function that converts after reading.


METADATA

  • Comment author - Kurt Van Dijck
  • Timestamp - 2019-04-08 18:42:53 UTC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant