-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
na.strings is too literal when column is quoted on file #2586
Comments
Please check that documentation edit makes sense. If so, maybe quotes should be disallowed from inside |
Yes, this came about from a real use-case: example was just meant to be minimal. For example, in the attached file, library(magrittr)
na <- "Age not report"
fread("Age-not-report-small.txt", na.strings = na, sep = ",") %$% anyNA(age_5y)
#> [1] FALSE
fread("Age-not-report-small.txt", na.strings = na, sep = ",") %$% any(age_5y == na)
#> [1] TRUE |
Strangely, though
|
Ok thanks. I think I get it now then. So it's when the source system uses some special string in a string column to represent NA, and then quotes every field when writing the csv. Hence the quotes around the special string. |
Just came across the same:
IMO they should return the same output -- I don't think I should need to keep track of the quoting rule used in my file unless absolutely necessary. Here, @mattdowle exactly right, and I think it's quite common for CSV writers to just quote everything by default. My use case has |
@MichaelChirico My understanding is that current behavior is not an omission: it was implemented specifically with the intention to disambiguate NA strings versus "NA" strings. For example, in a file like this
the second row is an empty string, while 3rd row is an NA string. Similarly, in this file (which must be parsed with
the second row has 2-character string Now of course your use case is just as valid as the ones presented here. I'm just pointing out that the current behavior is not a clear-cut "bug": it is the way it is by design. The real question then is whether the current design is a good one, or where do we go from here?
The important use-case to consider is that of an empty string (which is the default
should we consider the column |
I agree that there is not a straightforward fix. However, my view is that when every entry in a column is quoted, the
The last value should be regarded as missing if I do think that if |
Agree w Hugh here. we shouldn't close the door to meticulous handling of
more ambiguous cases but my feeling is cases like mine will be far more
common.
Especially in the special set of cases when a column is fully properly
quoted for a given field on _every_ row, where i think it's easy to agree
about the right behavior
…On Thu, May 3, 2018, 1:01 AM HughParsonage ***@***.***> wrote:
I agree that there is not a straightforward fix. However, my view is that
when every entry in a column is quoted, the na.strings argument does not
need to be. For example
A,"B"
1,"y",
2,"x",
3,"NA"
The last value should be regarded as missing if na.strings = "NA". So
this fits in with your third bullet point (though I'm actually advocating
for a more narrow change: only if *all* the values are quoted should
`"NA" be treated as an NA-string).
I do think that if na.strings contains a string s then s %in% v should be
FALSE for all v, and if the parser cannot honour this, there should be a
warning. For me at least, fread is far better than a text editor to view
files, and so I use it to modify na.strings as required, using the values
as parsed. In the file that motivated this issue, the string "Age not
report" was undocumented and occurred about 70 million rows down and 120
columns across, so the only plausible way I was ever going to detect it was
to use fread and unique on the column. But since fread had stripped away
the quotes, I saw Age not report rather than "Age not report".
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2586 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHQQdWcpmZfG0f5_hYA1FyBynZX86hg5ks5tueZ4gaJpZM4RqaXA>
.
|
So suppose we have the following 4 test files:
What do you think is the ideal way to read each of these files with each of the following commands?
Once we have an agreement on what the "right" way is, we can try to figure out what logic can implement it. |
Assuming default
|
After some more thought, I'm inclined to side with @HughParsonage here: if the user says This includes file3.txt with An altogether separate issue is what the default value of |
That sounds good to me. I think changing the default to |
Relevant question on StackOverflow (I think): Behaviour of fread for quoted character columns in version 1.11.0 The current behavior in 1.11.0 leads apparently to confusion. If it's not too much effort, it might be worth considering to include a fix in 1.11.2? |
Note: there are also reports that the current workaround (providing NA strings with quotes) is not reliable either: fread('
c1, c2, c3, c4
a, b, c, d
nan, inf, "inf", "nan"
', na.strings = c('inf', 'nan', '"inf"', '"nan"'))
# c1 c2 c3 c4
# <char> <char> <char> <char>
# 1: a b c d
# 2: <NA> <NA> inf nan |
Something else to consider is that with the I don't believe the So recreating this table: library(data.table)
df <- data.table(i = c(1.0, NA, 3.0, 4.0), value = c("foo", "bar", NA, "baz"))
df
# i value
# 1: 1 foo
# 2: NA bar
# 3: 3 <NA>
# 4: 4 baz
str(df)
# Classes ‘data.table’ and 'data.frame': 4 obs. of 2 variables:
# $ i : num 1 NA 3 4
# $ value: chr "foo" "bar" NA "baz"
# - attr(*, ".internal.selfref")=<externalptr> Using the following python code: import pandas as pd
import csv
df = pd.DataFrame({'i': [1.0, None, 3.0, 4.0], 'value': ['foo', 'bar', None, 'baz']})
print(df)
# i value
# 0 1.0 foo
# 1 NaN bar
# 2 3.0 None
# 3 4.0 baz And writing it to a txt file with: df.to_csv('file.txt', index=False, quoting=csv.QUOTE_NONNUMERIC) Results in this file:
With library(data.table)
df <- fread('"i","value"\n1.0,"foo"\n"","bar"\n3.0,""\n4.0,"baz"', na.strings = c('', '""'))
df
# i value
# 1: 1 foo
# 2: NA bar
# 3: 3 NA
# 4: 4 baz
str(df)
# Classes ‘data.table’ and 'data.frame': 4 obs. of 2 variables:
# $ i : num 1 NA 3 4
# $ value: chr "foo" "bar" NA "baz"
# - attr(*, ".internal.selfref")=<externalptr> While using library(data.table)
df <- fread('"i","value"\n1.0,"foo"\n"","bar"\n3.0,""\n4.0,"baz"', na.strings = c(''))
df
# i value
# 1: 1.0 foo
# 2: NA bar
# 3: 3.0 NA
# 4: 4.0 baz
str(df)
# Classes ‘data.table’ and 'data.frame': 4 obs. of 2 variables:
# $ i : chr "1.0" NA "3.0" "4.0"
# $ value: chr "foo" "bar" NA "baz"
# - attr(*, ".internal.selfref")=<externalptr> With library(data.table)
df1 <- fread('"i","value"\n1.0,"foo"\n"","bar"\n3.0,""\n4.0,"baz"', na.strings = c(''))
df2 <- fread('"i","value"\n1.0,"foo"\n"","bar"\n3.0,""\n4.0,"baz"', na.strings = c('', '""'))
df3 <- fread('"i","value"\n1.0,"foo"\n"","bar"\n3.0,""\n4.0,"baz"', na.strings = c('""'))
identical(df1, df2)
# TRUE
identical(df1, df3)
# TRUE
df1
# i value
# 1: 1 foo
# 2: NA bar
# 3: 3
# 4: 4 baz
str(df1)
# Classes ‘data.table’ and 'data.frame': 4 obs. of 2 variables:
# $ i : num 1 NA 3 4
# $ value: chr "foo" "bar" "" "baz"
# - attr(*, ".internal.selfref")=<externalptr> |
I am continuing to have problems with quoted NA strings in version 1.12.2 where the same quoted NA string ("-" in my case) is differentially treated depending on whether the field is read as integer, numeric, or character. Consider the following example:
Is this expected behavior? If so, how can I circumvent this behavior to have all "-" recognized as NA irrespective of the class of the field? To give an idea of the context of this issue, I am working with a lab manifest that has a mix of sample IDs, comments, and measurements which are all reported in a fully quoted csv. Running sed or another command line tool is not an option to replace the "-" prior to fread since some of the sample IDs contain the "-" character. |
Hey, @HughParsonage , did you get any solution for this case? |
Came to report this same issue in a real use case. Here "no se midió" should be the NA value, but because columns are quoted it doesn't work as expected. url <- "https://ciam.ambiente.gob.ar/dt_csv.php?dt_id=372"
data.table::fread(url, sep = ";", na.strings = "no se midió") |>
_$escher_coli_nmp_100ml |>
head()
#> [1] "no se midió" "no se midió" "no se midió" "no se midió" "no se midió"
#> [6] "no se midió"
data.table::fread(url, sep = ";", na.strings = '"no se midió"') |>
_$escher_coli_nmp_100ml |>
head()
#> [1] NA NA NA NA NA NA
read.csv(url, sep = ";", na.strings = "no se midió") |>
_$escher_coli_nmp_100ml |>
head()
#> [1] NA NA NA NA NA NA Created on 2023-12-08 with reprex v2.0.2 |
#
Min reprex
Using
na_string <- '"x y z"'
will get the right answer, but that was a bit difficult to deduce.Output of verbose output:
data.table
version:#
Output of sessionInfo()
The text was updated successfully, but these errors were encountered: