Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid Font Weight/Illegal character #75

Open
mfreyrie opened this issue Feb 27, 2020 · 1 comment
Open

Invalid Font Weight/Illegal character #75

mfreyrie opened this issue Feb 27, 2020 · 1 comment

Comments

@mfreyrie
Copy link

Hi,
I've been struggling with the import of multiple pdfs. I need to create a corpus, but for some reason I continue getting the same error while using pdftools as a method to extract the texts using the tm package. It works if I try to import just one pdf however.
This is what I do:

library(tm)
library(pdftools)

files <- list.files(pattern = "pdf$")
opinions <- lapply(files, pdf_text)

This is what I get

PDF error: Invalid Font Weight
PDF error: Invalid Font Weight
PDF error: Invalid Font Weight
PDF error: Invalid Font Weight
[...]
PDF error (218): Illegal character <2f> in hex string
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_text(loadfile(pdf), opw, upw) : PDF parsing failure.

My sessioninfo


> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252 
[2] LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] broom_0.5.2     tm_0.7-7        NLP_0.2-0      
 [4] pdftools_2.3    tidytext_0.2.2  forcats_0.4.0  
 [7] stringr_1.4.0   dplyr_0.8.3     purrr_0.3.3    
[10] readr_1.3.1     tidyr_1.0.0     tibble_2.1.3   
[13] ggplot2_3.2.1   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] qpdf_1.1          tidyselect_0.2.5  slam_0.1-46      
 [4] haven_2.2.0       lattice_0.20-38   colorspace_1.4-1 
 [7] vctrs_0.2.0       generics_0.0.2    SnowballC_0.6.0  
[10] rlang_0.4.2       pillar_1.4.3      glue_1.3.1       
[13] withr_2.1.2       DBI_1.1.0         dbplyr_1.4.2     
[16] modelr_0.1.6      readxl_1.3.1      lifecycle_0.1.0  
[19] munsell_0.5.0     gtable_0.3.0      cellranger_1.1.0 
[22] rvest_0.3.5       parallel_3.6.1    tokenizers_0.2.1 
[25] Rcpp_1.0.3        scales_1.1.0      backports_1.1.5  
[28] jsonlite_1.6      fs_1.3.1          askpass_1.1      
[31] hms_0.5.3         stringi_1.4.3     grid_3.6.1       
[34] cli_2.0.1         tools_3.6.1       magrittr_1.5     
[37] lazyeval_0.2.2    janeaustenr_0.1.5 crayon_1.3.4     
[40] pkgconfig_2.0.3   zeallot_0.1.0     Matrix_1.2-17    
[43] xml2_1.2.2        reprex_0.3.0      lubridate_1.7.4  
[46] assertthat_0.2.1  httr_1.4.1        rstudioapi_0.11  
[49] R6_2.4.1          nlme_3.1-140      compiler_3.6.1   

This is an example of the PDFs I'm using. It's this entire batch that doesn't work, also from different sources.
12.pdf

@jeroen
Copy link
Member

jeroen commented Mar 15, 2020

The example pdf you post works fine for me, I don't get an error:

txt <- pdf_text("~/Downloads/12.pdf")
cat(txt)

Are you sure you aren't accidentally feeding non-pdf files? What does your files variable contain?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants