Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

na.strings is too literal when column is quoted on file #2586

Open
HughParsonage opened this issue Jan 23, 2018 · 18 comments
Open

na.strings is too literal when column is quoted on file #2586

HughParsonage opened this issue Jan 23, 2018 · 18 comments

Comments

@HughParsonage
Copy link
Member

HughParsonage commented Jan 23, 2018

# Min reprex

na_string <- "x y z"
out <- fread('A,B,C\nD,"x y z",E', na.strings = na_string, verbose = TRUE, sep = ",")
anyNA(out$B)
#> [1] FALSE
### yet
na_string %chin% out$B
#> [1] TRUE

Using na_string <- '"x y z"' will get the right answer, but that was a bit difficult to deduce.

Output of verbose output:
Input contains a \n or is "". Taking this to be text input (not a filename)
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<"x y z">>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  `input` argument is provided rather than a file name, interpreting as raw text to read
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<A,B,C>>
[06] Detect separator, quoting rule, and ncolumns
  Using supplied sep ','
  sep=','  with 2 lines of 3 fields using quote rule 0
  Detected 3 columns on line 1. This line is either column names or first data row. Line starts as: <<A,B,C>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 3
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 1 because (17 bytes from row 1 to eof) / (2 * 17 jump0size) == 0
  'header' determined to be true due to column 2 containing a string on row 1 and a lower type (bool8) on row 2
  Type codes (jump 000)    : A1A  Quote rule 0
  =====
  Sampled 2 rows (handled \n inside quoted fields) at 1 jump points
  Bytes from first data row on line 1 to the end of last row: 17
  Line length: mean=8.50 sd=3.54 min=6 max=11
  Estimated number of rows: 17 / 8.50 = 2
  Initial alloc = 2 rows (2 + 0%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  All rows were sampled since file is small so we know nrow=2 exactly
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : A1A
[10] Allocate memory for the datatable
  Allocating 3 column slots (3 - 0 dropped) with 2 rows
[11] Read the data
  jumps=[0..1), chunk_size=1048576, total_size=11
Read 1 rows x 3 columns from 17 bytes file in 00:00.008 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '1'
         2 : string    'A'
=============================
   0.002s ( 19%) Memory map 0.000GB file
   0.005s ( 62%) sep=',' ncol=3 and header detection
   0.001s (  6%) Column type detection using 2 sample rows
   0.000s (  6%) Allocation of 2 rows x 3 cols (0.000GB) of which 1 ( 50%) rows used
   0.000s (  6%) Reading 1 chunks of 1.000MB (123361 rows) using 1 threads
   =    0.000s (  0%) Finding first non-embedded \n after each jump
   +    0.000s (  0%) Parse to row-major thread buffers (grown 0 times)
   +    0.000s (  0%) Transpose
   +    0.000s (  6%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
   0.008s        Total

data.table version:

data.table 1.10.5 IN DEVELOPMENT built 2018-01-23 00:48:08 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com

# Output of sessionInfo()

R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C                       LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] hutils_0.9.0.9    data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.3  magrittr_1.5    tools_3.4.3     yaml_2.1.16     fastmatch_1.1-0 knitr_1.18     
[7] reprex_0.1.1 
@mattdowle
Copy link
Member

mattdowle commented Mar 3, 2018

Please check that documentation edit makes sense. If so, maybe quotes should be disallowed from inside na.strings values. You shouldn't really be able to do what you did above! Is there a real use-case that needs quoted NA values to be recognized? Maybe if the separator is in the NA value or something crazy like that? (If sep is present in na.strings= that should be disallowed too, thinking about it.)

@HughParsonage
Copy link
Member Author

HughParsonage commented Mar 3, 2018

Yes, this came about from a real use-case: example was just meant to be minimal.

For example, in the attached file, Age not report was intended to be a missing value:

library(magrittr)
na <- "Age not report"
fread("Age-not-report-small.txt", na.strings = na, sep = ",") %$% anyNA(age_5y)
#> [1] FALSE

fread("Age-not-report-small.txt", na.strings = na, sep = ",") %$% any(age_5y == na)
#> [1] TRUE

Age-not-report-small.txt

@HughParsonage
Copy link
Member Author

HughParsonage commented Mar 3, 2018

Strangely, though readr::read_csv does get it right, it takes a long time to read in:

system.time(readr::read_csv("Age-not-report-small.txt", na = "Age not report"))
Parsed with column specification:
cols(
  age_5y = col_character()
)
   user  system elapsed 
   1.15    0.00    1.15 

@mattdowle
Copy link
Member

Ok thanks. I think I get it now then. So it's when the source system uses some special string in a string column to represent NA, and then quotes every field when writing the csv. Hence the quotes around the special string.

@mattdowle mattdowle modified the milestones: v1.11.0, v1.11.2 Apr 29, 2018
@MichaelChirico
Copy link
Member

MichaelChirico commented May 2, 2018

Just came across the same:

fread('a,b\n1,"-"', na.strings = '-')
#    a b
# 1: 1 -

# compared to
fread('a,b\n1,"-"', na.strings = '"-"')
#    a  b
# 1: 1 NA

IMO they should return the same output -- I don't think I should need to keep track of the quoting rule used in my file unless absolutely necessary. Here, fread correctly determines that field is surrounded by " and understands the contents of the field to be between "", so na.strings should apply to the content of the field.

@mattdowle exactly right, and I think it's quite common for CSV writers to just quote everything by default. My use case has - / - as the NA string...

@st-pasha
Copy link
Contributor

st-pasha commented May 2, 2018

@MichaelChirico My understanding is that current behavior is not an omission: it was implemented specifically with the intention to disambiguate NA strings versus "NA" strings. For example, in a file like this

key,value
1,"orange"
2,""
3,
4,"apple"

the second row is an empty string, while 3rd row is an NA string.

Similarly, in this file (which must be parsed with na_strings="NA"):

index,state
93456,"CA"
72001,"NA"
14829,NA

the second row has 2-character string "NA", whereas the 3rd row contains a missing (NA) string.


Now of course your use case is just as valid as the ones presented here. I'm just pointing out that the current behavior is not a clear-cut "bug": it is the way it is by design.

The real question then is whether the current design is a good one, or where do we go from here?

  • The simplest approach is not to do anything: declare the current behavior correct and close the issue;
  • Another approach would be to treat both "NA" and NA as an NA-string, as the OP suggests;
  • Another possibility is to treat "NA" as an NA-string only if there are no unquoted NAs in the same column, however if there are then "NA" becomes a 2-character string;
  • Or perhaps we treat both "NA" and NA as 2-character strings (not NAs) regardless of the na_strings setting -- this way a character column produced by fread will never contain an NA value.
  • any other?

The important use-case to consider is that of an empty string (which is the default na_strings setting). In a file like this:

A,B,C
1,foo,2
3,,4
,bar,5

should we consider the column B to contain ["foo", "", "bar"] or ["foo", NA, "bar"]. In my view the first is a more reasonable interpretation, but of course opinions may vary...

@HughParsonage
Copy link
Member Author

I agree that there is not a straightforward fix. However, my view is that when every entry in a column is quoted, the na.strings argument does not need to be. For example

A,"B"
1,"y",
2,"x",
3,"NA"

The last value should be regarded as missing if na.strings = "NA". So this fits in with your third bullet point (though I'm actually advocating for a more narrow change: only if all the values are quoted should `"NA" be treated as an NA-string).

I do think that if na.strings contains a string s then s %in% v should be FALSE for all v, and if the parser cannot honour this, there should be a warning. For me at least, fread is far better than a text editor to view files, and so I use it to modify na.strings as required, using the values as parsed. In the file that motivated this issue, the string "Age not report" was undocumented and occurred about 70 million rows down and 120 columns across, so the only plausible way I was ever going to detect it was to use fread and unique on the column. But since fread had stripped away the quotes, I saw Age not report rather than "Age not report".

@MichaelChirico
Copy link
Member

MichaelChirico commented May 2, 2018 via email

@st-pasha
Copy link
Contributor

st-pasha commented May 2, 2018

So suppose we have the following 4 test files:

file1.txt:     file2.txt:      file3.txt:      file4.txt:

i,value        i,"value"       i,value         i,value
1,foo          1,"foo"         1,"foo"         1,"foo"
2,bar          2,"bar"         2,"bar"         2,"bar"
3,             3,""            3,              3,""
4,baz          4,"baz"         4,"baz"         4,baz

What do you think is the ideal way to read each of these files with each of the following commands?

fread("fileN.txt", na.strings="")
fread("fileN.txt", na.strings="baz")
fread("fileN.txt")

Once we have an agreement on what the "right" way is, we can try to figure out what logic can implement it.

@MichaelChirico
Copy link
Member

MichaelChirico commented May 3, 2018

Assuming default na.strings is "NA". First pass I give:

File na.strings="" na.strings="baz" na.strings="NA"
file1.txt c('foo','bar',NA,'baz') c('foo','bar','',NA) c('foo','bar','','baz')
file2.txt c('foo','bar',NA,'baz') c('foo','bar','',NA) c('foo','bar','','baz')
file3.txt c('foo','bar','','baz') c('foo','bar','',NA) c('foo','bar','','baz')
file4.txt c('foo','bar',NA,'baz') c('foo','bar','',NA) c('foo','bar','','baz')

file3 is the one that gives me pause. I gave this answer since na.strings = NA_character_ can allow user to set this as NA if they want it to be. I don't give much weight to whether header row is/isn't quoted so file2/file4 are the same to me.

@st-pasha
Copy link
Contributor

st-pasha commented May 7, 2018

After some more thought, I'm inclined to side with @HughParsonage here: if the user says na.strings="something", then there is a clear expectation that all occurrences of "something" will be replaced with NAs, regardless of whether they were quoted or not. In other words, there must be no "something" values in the output.

This includes file3.txt with na.strings="" too: it should be parsed exactly as file1, file2 or file4. (Easier to implement/document too!).

An altogether separate issue is what the default value of na.strings should be. Currently it's "NA", but there is a note in NEWS.md that this will be changed to "" in the future. I believe that the default parse of file1, file2 and file4 should be c('foo','bar','','baz'), whereas for file3 it could be c('foo','bar',NA,'baz') if we really want to (see #2217 for example). However this would require that the default value of na.strings was not "" but some __auto__ value that indicates the need for special handling. Note also #2100 which asks to make na.strings even more automatic.

@MichaelChirico
Copy link
Member

That sounds good to me. I think changing the default to "" will probably break a lot of code, maybe better to jump right to __auto__-like behavior, depending on the implementation difficulty.

@jaapwalhout
Copy link

jaapwalhout commented May 8, 2018

Relevant question on StackOverflow (I think):

Behaviour of fread for quoted character columns in version 1.11.0

The current behavior in 1.11.0 leads apparently to confusion. If it's not too much effort, it might be worth considering to include a fix in 1.11.2?

@st-pasha
Copy link
Contributor

st-pasha commented May 8, 2018

Note: there are also reports that the current workaround (providing NA strings with quotes) is not reliable either:

fread('
  c1,  c2,  c3,    c4
  a,   b,   c,     d
  nan, inf, "inf", "nan"
', na.strings = c('inf', 'nan', '"inf"', '"nan"'))

#        c1     c2     c3     c4
#    <char> <char> <char> <char>
# 1:      a      b      c      d
# 2:   <NA>   <NA>    inf    nan

@ray-p144
Copy link

ray-p144 commented May 10, 2018

Something else to consider is that with the python library pandas, writing to a csv file with the quoting=csv.QUOTE_NONNUMERIC will always write "" in the field for missing values (even for numeric columns).

I don't believe the python/pandas way is "correct" but data.table 1.10.4-3 is able to handle it by specifying na.strings = c('', '""'), while data.table 1.11.2 is not. For small files, manually looping over the character columns and replacing "" with NA isn't too much of a hassle, but for people who switch between python and R a lot, being able to read in files written with pandas would be very helpful.

So recreating this table:

library(data.table)
df <- data.table(i = c(1.0, NA, 3.0, 4.0), value = c("foo", "bar", NA, "baz"))
df
#     i value
# 1:  1   foo
# 2: NA   bar
# 3:  3  <NA>
# 4:  4   baz
str(df)
# Classes ‘data.table’ and 'data.frame':	4 obs. of  2 variables:
#  $ i    : num  1 NA 3 4
#  $ value: chr  "foo" "bar" NA "baz"
#  - attr(*, ".internal.selfref")=<externalptr>

Using the following python code:

import pandas as pd
import csv
df = pd.DataFrame({'i': [1.0, None, 3.0, 4.0], 'value': ['foo', 'bar', None, 'baz']})
print(df)
#      i value
# 0  1.0   foo
# 1  NaN   bar
# 2  3.0  None
# 3  4.0   baz

And writing it to a txt file with:

df.to_csv('file.txt', index=False, quoting=csv.QUOTE_NONNUMERIC)

Results in this file:

"i","value"
1.0,"foo"
"","bar"
3.0,""
4.0,"baz"

With data.table 1.10.4-3 I am able to read it back in to match the original R table using na.strings = c('', '""'):

library(data.table)
df <- fread('"i","value"\n1.0,"foo"\n"","bar"\n3.0,""\n4.0,"baz"', na.strings = c('', '""'))
df
#     i value
# 1:  1   foo
# 2: NA   bar
# 3:  3    NA
# 4:  4   baz
str(df)
# Classes ‘data.table’ and 'data.frame':	4 obs. of  2 variables:
#  $ i    : num  1 NA 3 4
#  $ value: chr  "foo" "bar" NA "baz"
#  - attr(*, ".internal.selfref")=<externalptr> 

While using na.strings = c('') with data.table 1.10.4-3 results in the column being read in as a character:

library(data.table)
df <- fread('"i","value"\n1.0,"foo"\n"","bar"\n3.0,""\n4.0,"baz"', na.strings = c(''))
df
#      i value
# 1: 1.0   foo
# 2:  NA   bar
# 3: 3.0    NA
# 4: 4.0   baz
str(df)
# Classes ‘data.table’ and 'data.frame':	4 obs. of  2 variables:
#  $ i    : chr  "1.0" NA "3.0" "4.0"
#  $ value: chr  "foo" "bar" NA "baz"
#  - attr(*, ".internal.selfref")=<externalptr> 

With data.table 1.11.2, there isn't a way to read the file in to match the original table. It always treats i as numeric and leaves the empty string in value:

library(data.table)
df1 <- fread('"i","value"\n1.0,"foo"\n"","bar"\n3.0,""\n4.0,"baz"', na.strings = c(''))
df2 <- fread('"i","value"\n1.0,"foo"\n"","bar"\n3.0,""\n4.0,"baz"', na.strings = c('', '""'))
df3 <- fread('"i","value"\n1.0,"foo"\n"","bar"\n3.0,""\n4.0,"baz"', na.strings = c('""'))
identical(df1, df2)
# TRUE
identical(df1, df3)
# TRUE
df1
#     i value
# 1:  1   foo
# 2: NA   bar
# 3:  3      
# 4:  4   baz
str(df1)
# Classes ‘data.table’ and 'data.frame':	4 obs. of  2 variables:
#  $ i    : num  1 NA 3 4
#  $ value: chr  "foo" "bar" "" "baz"
#  - attr(*, ".internal.selfref")=<externalptr> 

@mattdowle mattdowle modified the milestones: 1.11.6, 1.11.4 May 12, 2018
@mattdowle mattdowle modified the milestones: 1.11.4, 1.11.6 May 24, 2018
@jangorecki jangorecki modified the milestones: 1.12.0, 1.11.6 Jun 6, 2018
@mattdowle mattdowle modified the milestones: 1.11.6, 1.12.0 Sep 20, 2018
@mattdowle mattdowle modified the milestones: 1.12.0, 1.12.2 Jan 11, 2019
@Brian-D-Hobbs
Copy link

I am continuing to have problems with quoted NA strings in version 1.12.2 where the same quoted NA string ("-" in my case) is differentially treated depending on whether the field is read as integer, numeric, or character.

Consider the following example:

dt1 <- fread('"A","B","C"\n"1","b","1.2"\n"-","-","-"', na.strings="-")
dt1
#    A B   C
# 1: 1 b 1.2
# 2: - -   -
dt2 <- fread('"A","B","C"\n"1","b","1.2"\n"-","-","-"', na.strings='"-"')
dt2
#     A B   C
# 1:  1 b 1.2
# 2: NA -  NA
sapply(dt2, class)
#           A           B           C
#   "integer" "character"   "numeric"

Is this expected behavior?

If so, how can I circumvent this behavior to have all "-" recognized as NA irrespective of the class of the field?

To give an idea of the context of this issue, I am working with a lab manifest that has a mix of sample IDs, comments, and measurements which are all reported in a fully quoted csv. Running sed or another command line tool is not an option to replace the "-" prior to fread since some of the sample IDs contain the "-" character.

@Dheinny
Copy link

Dheinny commented Jul 4, 2020

Yes, this came about from a real use-case: example was just meant to be minimal.

For example, in the attached file, Age not report was intended to be a missing value:

library(magrittr)
na <- "Age not report"
fread("Age-not-report-small.txt", na.strings = na, sep = ",") %$% anyNA(age_5y)
#> [1] FALSE

fread("Age-not-report-small.txt", na.strings = na, sep = ",") %$% any(age_5y == na)
#> [1] TRUE

Age-not-report-small.txt

Hey, @HughParsonage , did you get any solution for this case?
Thanks in advance

@eliocamp
Copy link
Contributor

eliocamp commented Dec 8, 2023

Came to report this same issue in a real use case. Here "no se midió" should be the NA value, but because columns are quoted it doesn't work as expected. read.csv does work.

url <- "https://ciam.ambiente.gob.ar/dt_csv.php?dt_id=372"
data.table::fread(url, sep = ";", na.strings = "no se midió") |> 
  _$escher_coli_nmp_100ml |> 
  head()
#> [1] "no se midió" "no se midió" "no se midió" "no se midió" "no se midió"
#> [6] "no se midió"

data.table::fread(url, sep = ";", na.strings = '"no se midió"') |> 
  _$escher_coli_nmp_100ml |> 
  head()
#> [1] NA NA NA NA NA NA


read.csv(url, sep = ";", na.strings =  "no se midió") |> 
  _$escher_coli_nmp_100ml |> 
  head()
#> [1] NA NA NA NA NA NA

Created on 2023-12-08 with reprex v2.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants