-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path01_sample.Rmd
70 lines (42 loc) · 996 Bytes
/
01_sample.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
title: "01_sample_data"
author: "Jae Yeon Kim"
date: "6/29/2020"
output: html_document
---
# Import libs and files
## Libs
```{r}
pacman::p_load(data.table, # for fast data import
tidyverse, # for tidyverse
here) # for reproducibility
```
## Files
```{r}
cle <- data.table::fread(here("raw_data", "clean_language_en.tsv"))
```
# Sample Tweet IDs
## Create a stratifying variable
```{r}
cle$month <- cle$V2 %>%
str_replace_all("-", "") %>%
str_replace_all(".{2}$", "")
```
## Sample
```{r}
# For reproducibility
set.seed(1234)
# Random sampling stratified by month
sampled <- cle %>%
group_by(month) %>%
slice_sample(n = 1000000,
replace = FALSE)
```
# Export
```{r}
# dir.create("../processed_data")
# Full data
fwrite(sampled[-1,], here("processed_data", "sampled.tsv"))
# Only Tweet IDs. This file will be used for hydrating.
fwrite(sampled[-1,1], here("processed_data", "sampled1.tsv"))
```