Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readLines with fread #1518

Closed
wants to merge 2 commits into from
Closed

Conversation

dselivanov
Copy link

It is tricky to use fread as replacement for readLines for reading non-tabular raw text data. In fact I realised that I actually can do that only after creating this PR.
Consider following example where I want to read text without splitting it into data.frame/data.table

txt <- fread('a,b\n ab,cd,ce\n abcdef\n hjkli \n')
# Error in fread("a,b\n ab,cd,ce\n abcdef\n hjkli \n") : 
#  Expected sep (',') but new line, EOF (or other non printing character) ends field 0 when detecting types (   first):  abcdef
#In addition: Warning message:
#In fread("a,b\n ab,cd,ce\n abcdef\n hjkli \n") :
#  Starting data input on line 2 and discarding line 1 because it has too few or too many items to be column names or data: a,b

Fortunately (and I realised that only after browsing fread.c source code) we can pass sep = "\n".

txt <- fread('a,b\n ab,cd,ce\n abcdef\n hjkli \n', sep = '\n')
#        a,b
#1: ab,cd,ce
#2:   abcdef
#3:    hjkli

Note, that we received "hjkli", not " hjkli ", because strip.white = TRUE by default. This also can be misleading.

And the last - most painful. It is not possible to do trick above on windows, where EOL = '\r\n':

fread("a,b\n ab,cd,ce\n abcdef\n hjkli \n", sep = "\r\n")
#Error in fread("a,b\n ab,cd,ce\n abcdef\n hjkli \n", sep = "\r\n") : 
#  'sep' must be 'auto' or a single character

@arunsrinivasan
Copy link
Member

Hm, Hash Sum mismatch error in travis-ci.. strange.

Looks nice, could you please remove irrelevant (spacing) lines from the PR so that the actual changes are easily visible to have a look at? Thanks.

@dselivanov dselivanov force-pushed the fread_lines branch 3 times, most recently from a50445e to 23fbe0f Compare February 5, 2016 06:51
@dselivanov
Copy link
Author

see updated PR. Do you mind if I will open separate PR for syntax/readability refinements?

@arunsrinivasan
Copy link
Member

@dselivanov I've not yet had time to look at the PR, sorry.

On syntax/readability, I'd suggest starting a new thread/issue so as to develop/write data.table package coding style (has been on my internal to-do for quite some time now :-( ).

@MichaelChirico
Copy link
Member

this is cool. i'd been using sep = "^" for this but always anxious that there will be a ^ hidden in my file one time. good to have a robust alternative.

@dselivanov
Copy link
Author

@arunsrinivasan see #1532.

@arunsrinivasan
Copy link
Member

@dselivanov just had a look. Nice! Could you explain in words about sep = NULL in fread.Rd? Seems to be the only thing that remains.

@MichaelChirico
Copy link
Member

Could also be useful to have some benchmarks to compare performance vis-a-vis readLines... just not sure what the proper medium for putting those benchmarks is. On the wiki somewhere? FAQ?

@dselivanov
Copy link
Author

@arunsrinivasan sure, will update PR soon. Another question - in case of sep = NULL should we still return data.table/data.frame with single column or just plain character?

@jangorecki
Copy link
Member

@dselivanov fread has data.table argument which could be used in that context, when data.table=FALSE then a vector could be returned.

@arunsrinivasan
Copy link
Member

@dselivanov same behaviour as always. Default is data.table. data.table=FALSE returns data.frame (1-col).

On second thought, sep=NULL is quite confusing. Would it be possible to have the same functionality with sep="\n" instead? The logic to detect if eol is \r\n or \n is already there.. so sep="\n" could just use it?

Also, sep="\n" could also be added as the last separator in the auto detection phase.

@dselivanov
Copy link
Author

@arunsrinivasan sure, we can check for "\n". For me sep = NULL looks very natural:

  1. sep = NULL - do not split/parse
  2. sep = NA_character_ or sep = "auto" - don't know separator, try to guess about it.
  3. sep = any_single_character - know separtor in advance, don't look for it.

Another option is to introduce new function, say, fread_lines which will return plain character and will have only a very few arguments.

@arunsrinivasan
Copy link
Member

sep="\n" looks natural to me. Why does sep=NULL look more natural to you? Not sure what you mean by split/parse.. it still has to take care of embedded \ns, isn't it?

@arunsrinivasan
Copy link
Member

Oh I see, sep=NULL reads the entire line as such (is your point). Hm, makes sense.

@arunsrinivasan arunsrinivasan removed this from the v1.9.8 milestone Mar 14, 2016
@arunsrinivasan
Copy link
Member

I'm not yet sure about this. Will come back later.

@MichaelChirico
Copy link
Member

Could you file an issue for this? I'd like to refer to an issue number rather than a PR number.

@dselivanov
Copy link
Author

@MichaelChirico feel free to file issue yourself if you really need it.

@dselivanov
Copy link
Author

Can I do anything else to improve this PR, so it could be merged?

@arunsrinivasan
Copy link
Member

arunsrinivasan commented May 13, 2016

@dselivanov sorry for the delay. I've given it some thought. I still find sep="\n" to be more natural. Of course this implicitly assumes "\r\n" if necessary. Would it be possible to make this change?

To me, sep=NULL seems appropriate for the behaviour where the entire file is read in as a character vector of length=1, which is not a particularly useful feature, but just mentioning it to explain what behaviour comes to my mind when I think of sep=NULL.

@dselivanov
Copy link
Author

dselivanov commented May 13, 2016

@arunsrinivasan ok agree with your point about sep=NULL. I will add sep="\n" in documentation. Internally it will still work with sep=NULL (so auto detection of \r\n will work). Will this be appropriate?

@arunsrinivasan
Copy link
Member

arunsrinivasan commented May 13, 2016

Okay. As long as sep=NULL errors (current behaviour), it's fine.

@dselivanov
Copy link
Author

See updated PR. Sorry for long update, totally forgot about it...

man/fread.Rd Outdated
@@ -15,14 +15,14 @@ skip=0L, select=NULL, drop=NULL, colClasses=NULL,
integer64=getOption("datatable.integer64"), # default: "integer64"
dec=if (sep!=".") "." else ",", col.names,
check.names=FALSE, encoding="unknown", quote="\"",
strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL,
strip.white = ifelse(sep == "\n", FALSE, TRUE), fill=FALSE, blank.lines.skip=FALSE, key=NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ifelse could be more useful for vectorized input, here strip.white = sep!="\n" or strip.white = !identical(sep,"\n") should be sufficient, isn't it? the latter one if we allow 0 length sep argument.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point!

@jangorecki
Copy link
Member

jangorecki commented Jul 21, 2016

does it wait for anything? assuming feedback was well addressed we could have this feature as it is pretty useful. @mattdowle @arunsrinivasan
edit: will launch check build after rebase due to change in dependencies (chron)

@dselivanov
Copy link
Author

I only kindly ask to merge it before 1.9.8 CRAN release :-).

21 июля 2016 г. 9:44 PM пользователь "Jan Gorecki" notifications@github.com
написал:

does it wait for anything? assuming feedback was well addressed we could
have this feature as it is pretty useful. @mattdowle
https://github.com/mattdowle @arunsrinivasan
https://github.com/arunsrinivasan


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1518 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE4u3cWggBU9O8HF-rrvkkYa5BRkszUoks5qX74YgaJpZM4HTjIL
.

@arunsrinivasan
Copy link
Member

@dselivanov if you could rebase and resolve the conflicts, I'll take a quick look and merge it in.

@jangorecki
Copy link
Member

BTW. IMO sep="\n"is not a perfect option. If that feature is about to be used as readLines then there is just a single field in each line, so no separation really. \n refers to new line, which isn't another field, but another row. sep should refers to separation between fields in a row.

@arunsrinivasan
Copy link
Member

@jangorecki EOF is always an implicit separator.

@jangorecki
Copy link
Member

Makes more sense, so it is consistent to how strsplit would handle that. Looks good

@dselivanov
Copy link
Author

I will try to resolve conflicts and update PR.

@dselivanov
Copy link
Author

@arunsrinivasan rebased now.

@@ -9108,6 +9108,12 @@ test_values <- c(53L, 53L, 52L, 1L, 52L, 1L, 1L,

test(1702, isoweek(test_cases), test_values)

#
lines <- fread("a,b\n ab,cd,ce\n abcdef\n hjkli \n", sep = "\n", header = T)[[1]]
test(1591.1, lines, c(" ab,cd,ce", " abcdef", " hjkli ") )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test numbers can be adjusted to avoid duplicated id of tests

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. I have amended PR.

strip.white = !identical(sep,"\n"),
fill = FALSE, blank.lines.skip = FALSE, key = NULL,
showProgress = getOption("datatable.showProgress"),
data.table = getOption("datatable.fread.datatable")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is more readable now 👍

src/fread.c Outdated
@@ -798,13 +802,16 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr
// Auto detect separator, number of fields, and location of first row
// ********************************************************************************************
const char *seps;
if (isNull(separg)) {
if(!isNull(separg)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally, this could have one extra space after if

Merge remote-tracking branch 'upstream/master' into fread_lines

# Conflicts:
#	R/fread.R
#	inst/tests/tests.Rraw
@dselivanov
Copy link
Author

dselivanov commented Sep 25, 2016

merge upstream to my PR branch.
@jangorecki I replaced your recent addition withfile argument to file=NULL. It is not a good practice to have arguments without default value after arguments with default values.

@codecov-io
Copy link

Current coverage is 90.66% (diff: 100%)

Merging #1518 into master will increase coverage by 0.03%

@@             master      #1518   diff @@
==========================================
  Files            58         58          
  Lines         10697      10702     +5   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           9695       9703     +8   
+ Misses         1002        999     -3   
  Partials          0          0          

Powered by Codecov. Last update 5959d67...42a02a9

@dselivanov
Copy link
Author

@arunsrinivasan @jangorecki anything else I can do?

@arunsrinivasan
Copy link
Member

Sorry, took a break. Seems fine but might be hard for this release as Matt seems to have almost wrapped up checking against dependent packages. Tagging @mattdowle just in case.

@dselivanov
Copy link
Author

I don't think this can be easily merged - logic of fread moved forward a lot - closing.

@dselivanov dselivanov closed this Sep 24, 2017
mattdowle added a commit that referenced this pull request Oct 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants