Invalid Font Weight/Illegal character #75

mfreyrie · 2020-02-27T08:17:56Z

Hi,
I've been struggling with the import of multiple pdfs. I need to create a corpus, but for some reason I continue getting the same error while using pdftools as a method to extract the texts using the tm package. It works if I try to import just one pdf however.
This is what I do:

library(tm)
library(pdftools)

files <- list.files(pattern = "pdf$")
opinions <- lapply(files, pdf_text)

This is what I get

PDF error: Invalid Font Weight
PDF error: Invalid Font Weight
PDF error: Invalid Font Weight
PDF error: Invalid Font Weight
[...]
PDF error (218): Illegal character <2f> in hex string
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_text(loadfile(pdf), opw, upw) : PDF parsing failure.

My sessioninfo


> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252 
[2] LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] broom_0.5.2     tm_0.7-7        NLP_0.2-0      
 [4] pdftools_2.3    tidytext_0.2.2  forcats_0.4.0  
 [7] stringr_1.4.0   dplyr_0.8.3     purrr_0.3.3    
[10] readr_1.3.1     tidyr_1.0.0     tibble_2.1.3   
[13] ggplot2_3.2.1   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] qpdf_1.1          tidyselect_0.2.5  slam_0.1-46      
 [4] haven_2.2.0       lattice_0.20-38   colorspace_1.4-1 
 [7] vctrs_0.2.0       generics_0.0.2    SnowballC_0.6.0  
[10] rlang_0.4.2       pillar_1.4.3      glue_1.3.1       
[13] withr_2.1.2       DBI_1.1.0         dbplyr_1.4.2     
[16] modelr_0.1.6      readxl_1.3.1      lifecycle_0.1.0  
[19] munsell_0.5.0     gtable_0.3.0      cellranger_1.1.0 
[22] rvest_0.3.5       parallel_3.6.1    tokenizers_0.2.1 
[25] Rcpp_1.0.3        scales_1.1.0      backports_1.1.5  
[28] jsonlite_1.6      fs_1.3.1          askpass_1.1      
[31] hms_0.5.3         stringi_1.4.3     grid_3.6.1       
[34] cli_2.0.1         tools_3.6.1       magrittr_1.5     
[37] lazyeval_0.2.2    janeaustenr_0.1.5 crayon_1.3.4     
[40] pkgconfig_2.0.3   zeallot_0.1.0     Matrix_1.2-17    
[43] xml2_1.2.2        reprex_0.3.0      lubridate_1.7.4  
[46] assertthat_0.2.1  httr_1.4.1        rstudioapi_0.11  
[49] R6_2.4.1          nlme_3.1-140      compiler_3.6.1

This is an example of the PDFs I'm using. It's this entire batch that doesn't work, also from different sources.
12.pdf

The text was updated successfully, but these errors were encountered:

jeroen · 2020-03-15T00:22:22Z

The example pdf you post works fine for me, I don't get an error:

txt <- pdf_text("~/Downloads/12.pdf")
cat(txt)

Are you sure you aren't accidentally feeding non-pdf files? What does your files variable contain?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid Font Weight/Illegal character #75

Invalid Font Weight/Illegal character #75

mfreyrie commented Feb 27, 2020

jeroen commented Mar 15, 2020 •

edited

Loading

Invalid Font Weight/Illegal character #75

Invalid Font Weight/Illegal character #75

Comments

mfreyrie commented Feb 27, 2020

jeroen commented Mar 15, 2020 • edited Loading

jeroen commented Mar 15, 2020 •

edited

Loading