forked from uqlibrary/technology-training
-
Notifications
You must be signed in to change notification settings - Fork 0
/
intro_to_programming.Rmd
508 lines (341 loc) · 13.4 KB
/
intro_to_programming.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
---
title: "Introduction to scientific programming"
author: "Stéphane Guillou"
date: "6 December 2018"
output: github_document
always_allow_html: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Welcome!
This workshop assumes no prior technical knowledge and focuses on:
* using a language interactively in a Command Line Interface
* introducing concepts and tools central to scientific programming
* working with source code efficiently
* automating tasks and enhancing reproducibility with small programs
This seminar will focus on the R language for a hands-on practice and extensive examples, but other languages will be presented, highlighting the differences between options.
The best way to learn is to practise, so we will quickly get into coding by all typing and executing the same commands. This is called "live-coding".
## Installation
Today, we will use R and RStudio. They are available on the library's training computers, but if you want to use your own laptop, please follow the instructions here: https://gitlab.com/stragu/CDS/blob/master/R/Installation.md
## Introduction
Let's start by adding our names and some info about us in our collaborative pad.
See how we can use some code to format our document? This is called **Markdown**, which is a concise language designed to easily create HTML documents. We are already coding!
### Why programming?
Programming allows you to use a programming language to build programs and **execute tasks**. There are many programs around, both Open Source and proprietary, that researchers can use to execute tasks by using a graphical user interface – or **GUI** – and clicking on buttons. You might however find that the programs you have been recommended do not help you with the specific task that you want to achieve, or at least do not do it efficiently.
Programming allows you to **automate** tasks by, for example, using loops to repeat a task many times, and creating functions to avoid writing the same code many times over.
Another added benefit to programming is that you can **share** your code with a team or with peers, and include it in a publication, in order to increase **reproducibility**.
Finally, because of the text-based nature of your program, you can use tools to keep track of **version history** and of **authorship** when collaborating with others.
### What language?
There are many languages you can use. They differ with their syntax, what they were designed for, and the size of the community that is actively using them.
In scientific computing, **R** and **Python** are two very popular languages. R was designed as a language for statistical analysis to be used interactively, whereas Python is more generalist.
### GUI vs CLI
A **GUI** gives the user visual elements to interact with the program, for example a button to click on in order to do a task.
A command-line interface, or **CLI**, takes an input (a command), executes it and sometimes prints some output.
Languages like R and Python can be used interactively in a command-line interface, which allows us to experiment and quickly get some results. However, if we need to create more complex series of steps for our data processing, it becomes more comfortable to work with **scripts**.
### IDE
An integrated development environment, or **IDE**, is a program that allows to write code more comfortably. Code is only text, and can therefore always be written in the most simple text editor, but an IDE makes it easier to work with code, test it, execute it, and sometimes find help and visualise output.
RStudio is an IDE primarily designed to work on R code.
## Programming with R
### Getting to know R
Let's open RStudio and start using the R language interactively.
The most simple thing we can do with R is using it as a calculator, with the help of **binary operators**.
For example, try executing the following commands:
R can be used like a calculator. Try the following commands:
```
3 * 4
10 / 2
11^6
```
### Objects
We can store data by creating objects, and assigning values to them with the **assignement operator** `<-`:
```
num1 <- 42
num2 <- x / 9
str1 <- "Hello world!"
```
You can use the shortcut <kbd>Alt</kbd> + <kbd>-</kbd> to type the assignement operator quicker.
> Different objects have different classes and will be handled differently.
### Using functions
An R function looks like this:
```
<functionname>(<argument(s)>)
```
In the console, we write a command and then evaluate it by pressing Enter.
For example, try running the following command:
```
sqrt(x = num1)
```
#### Finding help
There are two main ways to find help about a function inside RStudio:
1. the shortcut command: `?functionname`
1. the keyboard shortcut: press <kbd>F1</kbd> with your cursor in a function name
**Exercise 1** - Use the help pages to find out what these functions do:
* `c()`
* `sum()`
* `rm()`
* `length()`
* `list.files()`
`c()` concatenates the arguments into a vector.
`sum()` returns the sum of several numbers.
`seq()` creates a sequence of numbers from the parameters you give it.
`rm()` removes an object from your environment.
`length()` returns the length of a vector.
`list.files()` lists all the files in a directory.
Let's do some more complex operations by combining two functions:
`ls()` lists the objects in the current R environment.
For example, try running the `ls()` function after executing the command `num3 <- 42`.
You can remove *all* the objects in the environment by using `ls()` as the value for the `list` argument:
```
rm(list = ls())
```
> Arguments can be named, or can be automatically matched if used in order.
### Packages
Base R contains quite a few functions, and you can already build most things from scratch.
However, other people have already developped many functions for specific uses, and made them available to all via **packages**.
We can install packages with a command. For example, to install the package `praise`:
```{r eval=FALSE}
install.packages("praise")
```
We now need to **load** the package to access its functions:
```{r}
library(praise) # load the package
praise() # use a function
```
### Custom functions
We can create our own **custom function**, so we can reuse code easily:
```{r}
greeting <- function(name) {
paste0(name, " is my name. Hello World! :)")
}
```
> Use <kbd>Shift</kbd> + <kbd>Enter</kbd> to go to the line without executing the command.
If I try running my function without a value for the argument `name`, I will get an error.
I can include a default value if I want to:
```{r}
greeting <- function(name = "Jane Doe") {
paste0(name, " is my name. Hello World! :)")
}
```
### Loops
If you want to execute something many times, you can use a **loop**:
```{r eval=FALSE}
for (i in 1:100) {
print(":D")
}
```
### Logical operators, booleans and if statements
A **logical operator** returns a **boolean**: TRUE or FALSE.
Common ones are `==` for equality, `!=` for inequality, `>` for greater than, `<` for smaller than, `>=` for greater or equal, and `<=` for smaller or equal.
Try those examples:
```{r}
1 == 1
1 != 1
3 > 4
7 <= c(7, 8, 6)
```
With these tools, we can check for conditions, for example in an **`if` statement**:
```{r}
number = rnorm(1)
if (number > 0) {
print("Positive")
} else {
print("Negative")
}
```
### Indexing
**Indexing** allows use to access a part of an object.
```{r}
fact <- c("People", "are", "alright")
fact[1]
fact[3] <- "hopeless"
fact[4] <- "right"
fact[nchar(fact) < 7]
```
> Indexing in R starts at 1. Most other languages start at 0.
## An example project
Let's put what we just learned to use with a concrete example!
### Creating a project
File > New Project > New Directory > New Project
### Creating a script
Working from a **script** is a lot more comfortable, and allows us to create a succession of steps to reproduce our process – in other words, a little program.
The RStudio source pane helps us writing code thanks to:
* handy shortcuts
* syntax highlighting
* automatic indentation
* feedback on code errors
* commenting
After building a script, we can execute in one click of a button.
You can create a script from the File > New File > R Script.
### Our data
We are going to work on a dataset of ten books from Project Gutenberg. We want to get some numbers on them, and to search for the frequency of some terms. But we want to make sure that we **automate the process**, as we are likely to do that on many more books.
The data archive is located at https://cloudstor.aarnet.edu.au/plus/s/rqOEQQCjcwBv36x/download and we can use the following commands to download and unzip it:
```{r}
download.file(url = "https://cloudstor.aarnet.edu.au/plus/s/O4XHRi4VqiaAmMK/download",
destfile = "data.zip")
unzip("data.zip")
```
### Import a book
```{r}
Anthem <- readLines("data/AynRand_Anthem.txt")
```
### Clean up the data
`stringr` is a useful library to work with strings.
```{r}
library(stringr)
```
> You can find a handy cheatsheet for this package here: https://github.com/rstudio/cheatsheets/raw/master/strings.pdf
The books all have a header and a footer that we want to remove.
```{r}
start <- str_which(Anthem, "START OF TH") + 1
end <- str_which(Anthem, "END OF TH") - 1
Anthem_sub <- Anthem[start:end]
```
We also want to convert everything to lowercase:
```{r}
Anthem_lower <- str_to_lower(Anthem_sub)
```
And remove the empty lines:
```{r}
Anthem_clean <- Anthem_lower[Anthem_lower != ""]
```
### Analyse the book
We want to find how characters the book contains:
```{r results="hide"}
nchar(Anthem_clean)
sum(nchar(Anthem_clean)) # base R
sum(str_length(Anthem_clean)) # stringr
```
> There are often many ways to do one thing.
We also want to find specific terms, and count them:
```{r results="hide"}
# find terms visually
str_view(Anthem_clean, "the")
# count specific terms
str_count(Anthem_clean, "love")
sum(str_count(Anthem_clean, "love"))
```
How many words does the book contain?
```{r results="hide"}
str_count(Anthem_clean, "\\w+")
sum(str_count(Anthem_clean, "\\w+"))
```
> **Regular expressions**, or **RegEx**, can be used to detect any pattern and create complex queries.
And how many lines?
```{r}
length(Anthem_clean)
```
### Creating functions
We want to create functions so we can reuse the code easily, on any book.
We can separate our process in two steps:
1. Importing and cleaning the data
2. Analysing the data
Let's start with the function to import and clean up a book:
```{r}
import_clean <- function(file) {
book <- readLines(file)
start <- str_which(book, "START OF TH") + 1
end <- str_which(book, "END OF TH") - 1
book <- book[start:end]
book <- str_to_lower(book)
}
```
#### Challenge N {.tabset}
##### Instructions
There's one more step missing: removing the empty strings. Can you add it to the function?
##### Solution
```{r}
import_clean <- function(file) {
book <- readLines(file)
start <- str_which(book, "START OF TH") + 1
end <- str_which(book, "END OF TH") - 1
book <- book[start:end]
book <- str_to_lower(book)
book <- book[book != ""]
}
```
####
Let's test our function:
```{r}
JaneEyre <- import_clean("data/CharlotteBronte_JaneEyre.txt")
```
Now, let's create the function that analyses the data. It needs to output a list that contains our stats:
```{r}
book_stats <- function(book) {
characters <- sum(str_length(book))
words <- sum(str_count(book, "\\w+"))
lines <- length(book)
results <- list(char_count = characters,
word_count = words,
line_count = lines)
results
}
```
#### Challenge N {.tabset}
##### Instructions
We've got the characters, the words and the lines. How would you add the search term data to your function?
##### Solution
```{r}
book_stats <- function(book, search_term) {
characters <- sum(str_length(book))
words <- sum(str_count(book, "\\w+"))
lines <- length(book)
terms <- sum(str_count(book, search_term))
results <- list(char_count = characters,
word_count = words,
line_count = lines,
term_count = terms)
results
}
```
####
Let's test our second function on the book we just imported:
```{r}
book_stats(JaneEyre, "love")
```
### Loop through all files
We will now use our functions on all our files, and store the data into a dataframe. But to save us some typing, we will use a loop.
First, let's create an empty dataframe:
```{r}
data <- data.frame()
```
Then, let build our `for` loop. We want to add an extra variable in there: the name of the file the data comes from.
```{r}
for (filename in list.files("data")) {
book <- import_clean(paste0("data/", filename))
results <- book_stats(book, "love")
results <- append(list(file = filename), results)
data <- rbind(data, results, stringsAsFactors = FALSE)
}
```
After executing our loop, we can visualise our dataset with:
```{r eval=FALSE}
View(data)
```
### Create new variables
`dplyr` is a useful package to manipulate data.
```{r results="hide"}
library(dplyr)
data <- data %>%
mutate(word_length = char_count / word_count,
love_score = term_count / word_count) %>%
arrange(love_score)
View(data)
```
### Visualise
`ggplot2` is a useful package to visualise data.
```{r}
library(ggplot2)
ggplot(data,
aes(x = reorder(file, love_score),
y = love_score)) +
geom_point() +
coord_flip() +
ylim(c(0, NA))
```
## Experiment with code!
## Other tools
### Python: another language
### Git: revision control and collaboration
### Bash: control a computer
## What next?