na.strings is too literal when column is quoted on file #2586

HughParsonage · 2018-01-23T22:21:37Z

# Min reprex

na_string <- "x y z"
out <- fread('A,B,C\nD,"x y z",E', na.strings = na_string, verbose = TRUE, sep = ",")
anyNA(out$B)
#> [1] FALSE
### yet
na_string %chin% out$B
#> [1] TRUE

Using na_string <- '"x y z"' will get the right answer, but that was a bit difficult to deduce.

Output of verbose output:

Input contains a \n or is "". Taking this to be text input (not a filename)
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<"x y z">>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  `input` argument is provided rather than a file name, interpreting as raw text to read
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<A,B,C>>
[06] Detect separator, quoting rule, and ncolumns
  Using supplied sep ','
  sep=','  with 2 lines of 3 fields using quote rule 0
  Detected 3 columns on line 1. This line is either column names or first data row. Line starts as: <<A,B,C>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 3
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 1 because (17 bytes from row 1 to eof) / (2 * 17 jump0size) == 0
  'header' determined to be true due to column 2 containing a string on row 1 and a lower type (bool8) on row 2
  Type codes (jump 000)    : A1A  Quote rule 0
  =====
  Sampled 2 rows (handled \n inside quoted fields) at 1 jump points
  Bytes from first data row on line 1 to the end of last row: 17
  Line length: mean=8.50 sd=3.54 min=6 max=11
  Estimated number of rows: 17 / 8.50 = 2
  Initial alloc = 2 rows (2 + 0%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  All rows were sampled since file is small so we know nrow=2 exactly
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : A1A
[10] Allocate memory for the datatable
  Allocating 3 column slots (3 - 0 dropped) with 2 rows
[11] Read the data
  jumps=[0..1), chunk_size=1048576, total_size=11
Read 1 rows x 3 columns from 17 bytes file in 00:00.008 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '1'
         2 : string    'A'
=============================
   0.002s ( 19%) Memory map 0.000GB file
   0.005s ( 62%) sep=',' ncol=3 and header detection
   0.001s (  6%) Column type detection using 2 sample rows
   0.000s (  6%) Allocation of 2 rows x 3 cols (0.000GB) of which 1 ( 50%) rows used
   0.000s (  6%) Reading 1 chunks of 1.000MB (123361 rows) using 1 threads
   =    0.000s (  0%) Finding first non-embedded \n after each jump
   +    0.000s (  0%) Parse to row-major thread buffers (grown 0 times)
   +    0.000s (  0%) Transpose
   +    0.000s (  6%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
   0.008s        Total

data.table version:

data.table 1.10.5 IN DEVELOPMENT built 2018-01-23 00:48:08 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com

# Output of sessionInfo()

R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C                       LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] hutils_0.9.0.9    data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.3  magrittr_1.5    tools_3.4.3     yaml_2.1.16     fastmatch_1.1-0 knitr_1.18     
[7] reprex_0.1.1

The text was updated successfully, but these errors were encountered:

mattdowle · 2018-03-03T03:03:21Z

Please check that documentation edit makes sense. If so, maybe quotes should be disallowed from inside na.strings values. You shouldn't really be able to do what you did above! Is there a real use-case that needs quoted NA values to be recognized? Maybe if the separator is in the NA value or something crazy like that? (If sep is present in na.strings= that should be disallowed too, thinking about it.)

HughParsonage · 2018-03-03T03:29:49Z

Yes, this came about from a real use-case: example was just meant to be minimal.

For example, in the attached file, Age not report was intended to be a missing value:

library(magrittr)
na <- "Age not report"
fread("Age-not-report-small.txt", na.strings = na, sep = ",") %$% anyNA(age_5y)
#> [1] FALSE

fread("Age-not-report-small.txt", na.strings = na, sep = ",") %$% any(age_5y == na)
#> [1] TRUE

Age-not-report-small.txt

HughParsonage · 2018-03-03T04:04:27Z

Strangely, though readr::read_csv does get it right, it takes a long time to read in:

system.time(readr::read_csv("Age-not-report-small.txt", na = "Age not report"))
Parsed with column specification:
cols(
  age_5y = col_character()
)
   user  system elapsed 
   1.15    0.00    1.15

mattdowle · 2018-03-03T05:54:12Z

Ok thanks. I think I get it now then. So it's when the source system uses some special string in a string column to represent NA, and then quotes every field when writing the csv. Hence the quotes around the special string.

MichaelChirico · 2018-05-02T02:45:25Z

Just came across the same:

fread('a,b\n1,"-"', na.strings = '-')
#    a b
# 1: 1 -

# compared to
fread('a,b\n1,"-"', na.strings = '"-"')
#    a  b
# 1: 1 NA

IMO they should return the same output -- I don't think I should need to keep track of the quoting rule used in my file unless absolutely necessary. Here, fread correctly determines that field is surrounded by " and understands the contents of the field to be between "", so na.strings should apply to the content of the field.

@mattdowle exactly right, and I think it's quite common for CSV writers to just quote everything by default. My use case has - / - as the NA string...

st-pasha · 2018-05-02T16:23:27Z

@MichaelChirico My understanding is that current behavior is not an omission: it was implemented specifically with the intention to disambiguate NA strings versus "NA" strings. For example, in a file like this

key,value
1,"orange"
2,""
3,
4,"apple"

the second row is an empty string, while 3rd row is an NA string.

Similarly, in this file (which must be parsed with na_strings="NA"):

index,state
93456,"CA"
72001,"NA"
14829,NA

the second row has 2-character string "NA", whereas the 3rd row contains a missing (NA) string.

Now of course your use case is just as valid as the ones presented here. I'm just pointing out that the current behavior is not a clear-cut "bug": it is the way it is by design.

The real question then is whether the current design is a good one, or where do we go from here?

The simplest approach is not to do anything: declare the current behavior correct and close the issue;
Another approach would be to treat both "NA" and NA as an NA-string, as the OP suggests;
Another possibility is to treat "NA" as an NA-string only if there are no unquoted NAs in the same column, however if there are then "NA" becomes a 2-character string;
Or perhaps we treat both "NA" and NA as 2-character strings (not NAs) regardless of the na_strings setting -- this way a character column produced by fread will never contain an NA value.
any other?

The important use-case to consider is that of an empty string (which is the default na_strings setting). In a file like this:

A,B,C
1,foo,2
3,,4
,bar,5

should we consider the column B to contain ["foo", "", "bar"] or ["foo", NA, "bar"]. In my view the first is a more reasonable interpretation, but of course opinions may vary...

HughParsonage · 2018-05-02T17:01:41Z

I agree that there is not a straightforward fix. However, my view is that when every entry in a column is quoted, the na.strings argument does not need to be. For example

A,"B"
1,"y",
2,"x",
3,"NA"

The last value should be regarded as missing if na.strings = "NA". So this fits in with your third bullet point (though I'm actually advocating for a more narrow change: only if all the values are quoted should `"NA" be treated as an NA-string).

I do think that if na.strings contains a string s then s %in% v should be FALSE for all v, and if the parser cannot honour this, there should be a warning. For me at least, fread is far better than a text editor to view files, and so I use it to modify na.strings as required, using the values as parsed. In the file that motivated this issue, the string "Age not report" was undocumented and occurred about 70 million rows down and 120 columns across, so the only plausible way I was ever going to detect it was to use fread and unique on the column. But since fread had stripped away the quotes, I saw Age not report rather than "Age not report".

MichaelChirico · 2018-05-02T17:46:37Z

Agree w Hugh here. we shouldn't close the door to meticulous handling of more ambiguous cases but my feeling is cases like mine will be far more common. Especially in the special set of cases when a column is fully properly quoted for a given field on _every_ row, where i think it's easy to agree about the right behavior

…

On Thu, May 3, 2018, 1:01 AM HughParsonage ***@***.***> wrote: I agree that there is not a straightforward fix. However, my view is that when every entry in a column is quoted, the na.strings argument does not need to be. For example A,"B" 1,"y", 2,"x", 3,"NA" The last value should be regarded as missing if na.strings = "NA". So this fits in with your third bullet point (though I'm actually advocating for a more narrow change: only if *all* the values are quoted should `"NA" be treated as an NA-string). I do think that if na.strings contains a string s then s %in% v should be FALSE for all v, and if the parser cannot honour this, there should be a warning. For me at least, fread is far better than a text editor to view files, and so I use it to modify na.strings as required, using the values as parsed. In the file that motivated this issue, the string "Age not report" was undocumented and occurred about 70 million rows down and 120 columns across, so the only plausible way I was ever going to detect it was to use fread and unique on the column. But since fread had stripped away the quotes, I saw Age not report rather than "Age not report". — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2586 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHQQdWcpmZfG0f5_hYA1FyBynZX86hg5ks5tueZ4gaJpZM4RqaXA> .

st-pasha · 2018-05-02T17:46:48Z

So suppose we have the following 4 test files:

file1.txt:     file2.txt:      file3.txt:      file4.txt:

i,value        i,"value"       i,value         i,value
1,foo          1,"foo"         1,"foo"         1,"foo"
2,bar          2,"bar"         2,"bar"         2,"bar"
3,             3,""            3,              3,""
4,baz          4,"baz"         4,"baz"         4,baz

What do you think is the ideal way to read each of these files with each of the following commands?

fread("fileN.txt", na.strings="")
fread("fileN.txt", na.strings="baz")
fread("fileN.txt")

Once we have an agreement on what the "right" way is, we can try to figure out what logic can implement it.

MichaelChirico · 2018-05-03T03:05:22Z

Assuming default na.strings is "NA". First pass I give:

File	`na.strings=""`	`na.strings="baz"`	`na.strings="NA"`
`file1.txt`	`c('foo','bar',NA,'baz')`	`c('foo','bar','',NA)`	`c('foo','bar','','baz')`
`file2.txt`	`c('foo','bar',NA,'baz')`	`c('foo','bar','',NA)`	`c('foo','bar','','baz')`
`file3.txt`	`c('foo','bar','','baz')`	`c('foo','bar','',NA)`	`c('foo','bar','','baz')`
`file4.txt`	`c('foo','bar',NA,'baz')`	`c('foo','bar','',NA)`	`c('foo','bar','','baz')`

file3 is the one that gives me pause. I gave this answer since na.strings = NA_character_ can allow user to set this as NA if they want it to be. I don't give much weight to whether header row is/isn't quoted so file2/file4 are the same to me.

st-pasha · 2018-05-07T20:36:46Z

After some more thought, I'm inclined to side with @HughParsonage here: if the user says na.strings="something", then there is a clear expectation that all occurrences of "something" will be replaced with NAs, regardless of whether they were quoted or not. In other words, there must be no "something" values in the output.

This includes file3.txt with na.strings="" too: it should be parsed exactly as file1, file2 or file4. (Easier to implement/document too!).

An altogether separate issue is what the default value of na.strings should be. Currently it's "NA", but there is a note in NEWS.md that this will be changed to "" in the future. I believe that the default parse of file1, file2 and file4 should be c('foo','bar','','baz'), whereas for file3 it could be c('foo','bar',NA,'baz') if we really want to (see #2217 for example). However this would require that the default value of na.strings was not "" but some __auto__ value that indicates the need for special handling. Note also #2100 which asks to make na.strings even more automatic.

MichaelChirico · 2018-05-08T03:22:45Z

That sounds good to me. I think changing the default to "" will probably break a lot of code, maybe better to jump right to __auto__-like behavior, depending on the implementation difficulty.

jaapwalhout · 2018-05-08T08:19:41Z

Relevant question on StackOverflow (I think):

Behaviour of fread for quoted character columns in version 1.11.0

The current behavior in 1.11.0 leads apparently to confusion. If it's not too much effort, it might be worth considering to include a fix in 1.11.2?

st-pasha · 2018-05-08T17:26:31Z

Note: there are also reports that the current workaround (providing NA strings with quotes) is not reliable either:

fread('
  c1,  c2,  c3,    c4
  a,   b,   c,     d
  nan, inf, "inf", "nan"
', na.strings = c('inf', 'nan', '"inf"', '"nan"'))

#        c1     c2     c3     c4
#    <char> <char> <char> <char>
# 1:      a      b      c      d
# 2:   <NA>   <NA>    inf    nan

ray-p144 · 2018-05-10T00:11:07Z

Something else to consider is that with the python library pandas, writing to a csv file with the quoting=csv.QUOTE_NONNUMERIC will always write "" in the field for missing values (even for numeric columns).

I don't believe the python/pandas way is "correct" but data.table 1.10.4-3 is able to handle it by specifying na.strings = c('', '""'), while data.table 1.11.2 is not. For small files, manually looping over the character columns and replacing "" with NA isn't too much of a hassle, but for people who switch between python and R a lot, being able to read in files written with pandas would be very helpful.

So recreating this table:

library(data.table)
df <- data.table(i = c(1.0, NA, 3.0, 4.0), value = c("foo", "bar", NA, "baz"))
df
#     i value
# 1:  1   foo
# 2: NA   bar
# 3:  3  <NA>
# 4:  4   baz
str(df)
# Classes ‘data.table’ and 'data.frame':	4 obs. of  2 variables:
#  $ i    : num  1 NA 3 4
#  $ value: chr  "foo" "bar" NA "baz"
#  - attr(*, ".internal.selfref")=<externalptr>

Using the following python code:

import pandas as pd
import csv
df = pd.DataFrame({'i': [1.0, None, 3.0, 4.0], 'value': ['foo', 'bar', None, 'baz']})
print(df)
#      i value
# 0  1.0   foo
# 1  NaN   bar
# 2  3.0  None
# 3  4.0   baz

And writing it to a txt file with:

df.to_csv('file.txt', index=False, quoting=csv.QUOTE_NONNUMERIC)

Results in this file:

"i","value"
1.0,"foo"
"","bar"
3.0,""
4.0,"baz"

With data.table 1.10.4-3 I am able to read it back in to match the original R table using na.strings = c('', '""'):

library(data.table)
df <- fread('"i","value"\n1.0,"foo"\n"","bar"\n3.0,""\n4.0,"baz"', na.strings = c('', '""'))
df
#     i value
# 1:  1   foo
# 2: NA   bar
# 3:  3    NA
# 4:  4   baz
str(df)
# Classes ‘data.table’ and 'data.frame':	4 obs. of  2 variables:
#  $ i    : num  1 NA 3 4
#  $ value: chr  "foo" "bar" NA "baz"
#  - attr(*, ".internal.selfref")=<externalptr>

While using na.strings = c('') with data.table 1.10.4-3 results in the column being read in as a character:

library(data.table)
df <- fread('"i","value"\n1.0,"foo"\n"","bar"\n3.0,""\n4.0,"baz"', na.strings = c(''))
df
#      i value
# 1: 1.0   foo
# 2:  NA   bar
# 3: 3.0    NA
# 4: 4.0   baz
str(df)
# Classes ‘data.table’ and 'data.frame':	4 obs. of  2 variables:
#  $ i    : chr  "1.0" NA "3.0" "4.0"
#  $ value: chr  "foo" "bar" NA "baz"
#  - attr(*, ".internal.selfref")=<externalptr>

With data.table 1.11.2, there isn't a way to read the file in to match the original table. It always treats i as numeric and leaves the empty string in value:

library(data.table)
df1 <- fread('"i","value"\n1.0,"foo"\n"","bar"\n3.0,""\n4.0,"baz"', na.strings = c(''))
df2 <- fread('"i","value"\n1.0,"foo"\n"","bar"\n3.0,""\n4.0,"baz"', na.strings = c('', '""'))
df3 <- fread('"i","value"\n1.0,"foo"\n"","bar"\n3.0,""\n4.0,"baz"', na.strings = c('""'))
identical(df1, df2)
# TRUE
identical(df1, df3)
# TRUE
df1
#     i value
# 1:  1   foo
# 2: NA   bar
# 3:  3      
# 4:  4   baz
str(df1)
# Classes ‘data.table’ and 'data.frame':	4 obs. of  2 variables:
#  $ i    : num  1 NA 3 4
#  $ value: chr  "foo" "bar" "" "baz"
#  - attr(*, ".internal.selfref")=<externalptr>

Brian-D-Hobbs · 2019-04-08T15:22:04Z

I am continuing to have problems with quoted NA strings in version 1.12.2 where the same quoted NA string ("-" in my case) is differentially treated depending on whether the field is read as integer, numeric, or character.

Consider the following example:

dt1 <- fread('"A","B","C"\n"1","b","1.2"\n"-","-","-"', na.strings="-")
dt1
#    A B   C
# 1: 1 b 1.2
# 2: - -   -
dt2 <- fread('"A","B","C"\n"1","b","1.2"\n"-","-","-"', na.strings='"-"')
dt2
#     A B   C
# 1:  1 b 1.2
# 2: NA -  NA
sapply(dt2, class)
#           A           B           C
#   "integer" "character"   "numeric"

Is this expected behavior?

If so, how can I circumvent this behavior to have all "-" recognized as NA irrespective of the class of the field?

To give an idea of the context of this issue, I am working with a lab manifest that has a mix of sample IDs, comments, and measurements which are all reported in a fully quoted csv. Running sed or another command line tool is not an option to replace the "-" prior to fread since some of the sample IDs contain the "-" character.

Dheinny · 2020-07-04T19:38:44Z

Yes, this came about from a real use-case: example was just meant to be minimal.

For example, in the attached file, Age not report was intended to be a missing value:
library(magrittr)
na <- "Age not report"
fread("Age-not-report-small.txt", na.strings = na, sep = ",") %$% anyNA(age_5y)
#> [1] FALSE

fread("Age-not-report-small.txt", na.strings = na, sep = ",") %$% any(age_5y == na)
#> [1] TRUE
Age-not-report-small.txt

Hey, @HughParsonage , did you get any solution for this case?
Thanks in advance

eliocamp · 2023-12-08T15:24:06Z

Came to report this same issue in a real use case. Here "no se midió" should be the NA value, but because columns are quoted it doesn't work as expected. read.csv does work.

url <- "https://ciam.ambiente.gob.ar/dt_csv.php?dt_id=372"
data.table::fread(url, sep = ";", na.strings = "no se midió") |> 
  _$escher_coli_nmp_100ml |> 
  head()
#> [1] "no se midió" "no se midió" "no se midió" "no se midió" "no se midió"
#> [6] "no se midió"

data.table::fread(url, sep = ";", na.strings = '"no se midió"') |> 
  _$escher_coli_nmp_100ml |> 
  head()
#> [1] NA NA NA NA NA NA


read.csv(url, sep = ";", na.strings =  "no se midió") |> 
  _$escher_coli_nmp_100ml |> 
  head()
#> [1] NA NA NA NA NA NA

^{Created on 2023-12-08 with reprex v2.0.2}

HughParsonage added bug fread labels Jan 23, 2018

mattdowle added this to the v1.10.6 milestone Mar 3, 2018

mattdowle mentioned this issue Mar 3, 2018

na.strings getOption() added so ,, can be read as NA by default in future #2652

Merged

2 tasks

mattdowle added a commit that referenced this issue Mar 3, 2018

Updated ?fread to resolve #2586

d58f73c

mattdowle modified the milestones: v1.11.0, v1.11.2 Apr 29, 2018

Rdatatable deleted a comment from chusqui May 8, 2018

st-pasha mentioned this issue May 8, 2018

Master task for fread bugs / proposals #2247

Closed

mattdowle modified the milestones: 1.11.6, 1.11.4 May 12, 2018

mattdowle modified the milestones: 1.11.4, 1.11.6 May 24, 2018

jangorecki modified the milestones: 1.12.0, 1.11.6 Jun 6, 2018

mattdowle modified the milestones: 1.11.6, 1.12.0 Sep 20, 2018

mattdowle modified the milestones: 1.12.0, 1.12.2 Jan 11, 2019

ghuls mentioned this issue Oct 20, 2021

CsvReader .with_null_values requires passing double-quotes as part of String pola-rs/polars#1564

Closed

ben-schwen mentioned this issue Jun 18, 2024

fread() parameter na.strings does not work as expected #6186

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

na.strings is too literal when column is quoted on file #2586

na.strings is too literal when column is quoted on file #2586

HughParsonage commented Jan 23, 2018 •

edited by st-pasha

Loading

mattdowle commented Mar 3, 2018 •

edited

Loading

HughParsonage commented Mar 3, 2018 •

edited

Loading

HughParsonage commented Mar 3, 2018 •

edited

Loading

mattdowle commented Mar 3, 2018

MichaelChirico commented May 2, 2018 •

edited

Loading

st-pasha commented May 2, 2018

HughParsonage commented May 2, 2018

MichaelChirico commented May 2, 2018 via email

st-pasha commented May 2, 2018

MichaelChirico commented May 3, 2018 •

edited

Loading

st-pasha commented May 7, 2018 •

edited

Loading

MichaelChirico commented May 8, 2018

jaapwalhout commented May 8, 2018 •

edited

Loading

st-pasha commented May 8, 2018

ray-p144 commented May 10, 2018 •

edited

Loading

Brian-D-Hobbs commented Apr 8, 2019

Dheinny commented Jul 4, 2020

eliocamp commented Dec 8, 2023

na.strings is too literal when column is quoted on file #2586

na.strings is too literal when column is quoted on file #2586

Comments

HughParsonage commented Jan 23, 2018 • edited by st-pasha Loading

mattdowle commented Mar 3, 2018 • edited Loading

HughParsonage commented Mar 3, 2018 • edited Loading

HughParsonage commented Mar 3, 2018 • edited Loading

mattdowle commented Mar 3, 2018

MichaelChirico commented May 2, 2018 • edited Loading

st-pasha commented May 2, 2018

HughParsonage commented May 2, 2018

MichaelChirico commented May 2, 2018 via email

st-pasha commented May 2, 2018

MichaelChirico commented May 3, 2018 • edited Loading

st-pasha commented May 7, 2018 • edited Loading

MichaelChirico commented May 8, 2018

jaapwalhout commented May 8, 2018 • edited Loading

st-pasha commented May 8, 2018

ray-p144 commented May 10, 2018 • edited Loading

Brian-D-Hobbs commented Apr 8, 2019

Dheinny commented Jul 4, 2020

eliocamp commented Dec 8, 2023

HughParsonage commented Jan 23, 2018 •

edited by st-pasha

Loading

mattdowle commented Mar 3, 2018 •

edited

Loading

HughParsonage commented Mar 3, 2018 •

edited

Loading

HughParsonage commented Mar 3, 2018 •

edited

Loading

MichaelChirico commented May 2, 2018 •

edited

Loading

MichaelChirico commented May 3, 2018 •

edited

Loading

st-pasha commented May 7, 2018 •

edited

Loading

jaapwalhout commented May 8, 2018 •

edited

Loading

ray-p144 commented May 10, 2018 •

edited

Loading