Skip to content
This repository has been archived by the owner on Feb 11, 2024. It is now read-only.

Tokens with "pads" can't be converted to dfm #46

Closed
chainsawriot opened this issue Nov 22, 2023 · 1 comment
Closed

Tokens with "pads" can't be converted to dfm #46

chainsawriot opened this issue Nov 22, 2023 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@chainsawriot
Copy link
Contributor

chainsawriot commented Nov 22, 2023

While working #27

require(quanteda); require(quanteda.proximity)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> Loading required package: quanteda.proximity
toks <- tokens(c("a b c", "A B C D")) |>
    tokens_remove("b", padding = TRUE)
toks
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "a" ""  "c"
#> 
#> text2 :
#> [1] "A" ""  "C" "D"
toks %>% tokens_proximity("a") %>% dfm()
#> Error in Matrix::sparseMatrix(j = index, p = cumsum(c(1L, lengths(x))) - : 'i' and 'j' must be positive

Created on 2023-11-22 with reprex v2.0.2

@chainsawriot chainsawriot added the bug Something isn't working label Nov 22, 2023
@chainsawriot chainsawriot self-assigned this Nov 22, 2023
@chainsawriot
Copy link
Contributor Author

chainsawriot commented Nov 22, 2023

pads are assigned position zero, which errs Matrix::sparseMatrix(). In quanteda, it's handled by the C++ code.

require(quanteda); require(quanteda.proximity)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> Loading required package: quanteda.proximity
toks <- tokens(c("a b c", "A B C D")) |>
    tokens_remove("b", padding = TRUE)
toks
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "a" ""  "c"
#> 
#> text2 :
#> [1] "A" ""  "C" "D"
toks %>% tokens_proximity("a") -> temp
unlist(unclass(temp), use.names = FALSE)
#> [1] 1 0 2 1 0 2 3
attr(temp, "types")
#> [1] "a" "c" "d"

Created on 2023-11-22 with reprex v2.0.2

A hacky solution is to check whether there are zeros in unlist(unclass(temp), use.names = FALSE); if TRUE, add 1. Add "" to attr(temp, "types"). Unlike how dfm.tokens_xptr() handles remove_padding, we remove the pad column in the dfm.

chainsawriot added a commit that referenced this issue Nov 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant