Skip to content

Assignment operator := does not accept anymore text values with non-ASCII with attribute Encoding = unknown #7648

@BASS-JN

Description

@BASS-JN

The small bug (since 1.18.0, retrieved from docker rocker/tidyverse-4.5.2, and still in 1.18.99 downloaded from CRAN as a tarball) is that I cannot assign with := to a factor column a value containing non-ASCII characters but whose Encoding attribute has been set to 'unknown'. It raises a "as.character.factor(x) : ill-formed factor". It is OK if I let "UTF-8" attribute.

It did not in data.table 1.16.4 nor 1.17.xx : an "unknown" attribute was accepted (with same version of forcats : 1.0.1).

But why should we do this weird thing (erasing Encoding attribute) ? Because of fst !!

As I am french, I regularly handle factors with levels with non-ASCII characters, typically 'Été' ("summer").

They are usually handled in native encoding (UTF-8 for Linux but latin-1 in Windows) : assignment from source code, or read in xlsx using openxlsx, etc. And other 100% ASCII values are let as "unknown".

That means that, for a factor like c('Été','Hiver'), the different levels have a mix of "UTF-8" if with non-ASCII, "unknown" else.

The problem comes with fst : for speedup reasons, it does not handle the Encoding attribute. So, each time I "write_fst" a data.table with accents (with Encoding == UTF-8), the next time, reading it (with read_fst), it lets all attributes to "unknown", either 100% ASCII levels or non-ASCII levels.

So comes the next problem : if a assign the same value but with "UTF-8" Encoding attribute, it actually creates a duplicated level (without complaining). It raises an error only a few steps later, for other requests.

My solution is to systematically convert every imported text (file reading, source code), from the very beginning, to native encoding of the platform (enc2native or equivalent from stringi pkg) to avoid any collision, and then set all Encoding attributes to 'unknown').

So, would it be possible to allow (again) assignment to factor columns of non-ASCII values marked as "unknown Encoding" ?

Thanks a lot for this wonderful work !
Jean-Noël BASS
F-78120 RAMBOUILLET (France)

Thanks! Please remove the text above and include the two items below.

# Minimal reproducible example; please be sure to set verbose=TRUE where possible!

library(magrittr); library(data.table); library(forcats)
testDT_EncodFactors <- function(){
v <- '\u00C9t\u00E9' %>% set_names(., Encoding(.)) # "Été" marked as UTF-8
v1 <- Encoding<-(v, 'unknown') %>% set_names(., Encoding(.)) # "Été" marked as unknown
cat('Normal encoding\n'); print(v)
cat('After setting encoding attribute to unknown\n'); print(v1)
cat('A data.table with only ASCII levels\n')
df <- data.table(Saison = factor(c('Hiver','Printemps','Automne')))
print(df)
cat('Assignment of a non-ASCII to a factor, marked as UTF8\n')
df[Saison == 'Automne', Saison := v]
print(df)
print(levels(df$Saison) %>% set_names(., Encoding(.))) # levels : mix of unknown and UTF-8

cat('Setting Encoding attribute to unknown in factor column\n')
df %<>% .[, Saison := fct_relabel(Saison, function(x){Encoding<-(x,'unknown')})]
print(df)
print(levels(df$Saison) %>% set_names(., Encoding(.))) # levels : all marked unknown

cat('Assignment of a non-ASCII marked as unknown Encoding, in accordance with already present levels : pb if print raises an error\n')
df[Saison == 'Printemps', Saison := v1]
print(df)
}
testDT_EncodFactors()

# Output of sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0

locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
[5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8 LC_PAPER=fr_FR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] compiler_4.5.2 tools_4.5.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions