-
Notifications
You must be signed in to change notification settings - Fork 3
/
10-table.Rmd
46 lines (31 loc) · 965 Bytes
/
10-table.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
```{r echo=FALSE}
knitr::opts_chunk$set(
comment = "#>",
warning = FALSE,
message = FALSE
)
```
# Summarize articles on disk {#on-disk}
The `ft_table()` function makes it easy to create a data.frame of the text of PDF, plain text, and XML files, together with DOIs/IDs for each article. It's similar to the `readtext::readtext()` function, but is much more specific to just this package.
With the output of `ft_table()` you can go directly into a text-mining package like `quanteda`.
## Usage {#ft_table-usage}
```{r}
library(fulltext)
```
Use `ft_table()` to pull out text from all articles.
```{r}
ft_table()
```
You can pull out just text from XML files
```{r}
ft_table(type = "xml")
```
You can pull out just text from PDF files
```{r}
ft_table(type = "pdf")
```
You can pull out XML but not extract the text. So you'll get XML strings
that you can parse yourself with xpath/css selectors/etc.
```{r}
ft_table(xml_extract_text = FALSE)
```