forked from kbroman/datacarpentry_R_2017-01-10
-
Notifications
You must be signed in to change notification settings - Fork 0
/
01-intro-to-R.html
591 lines (546 loc) · 36.7 KB
/
01-intro-to-R.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />
<meta name="author" content="Data Carpentry contributors" />
<title>Intro to R</title>
<script src="libs/jquery-1.11.3/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link href="libs/bootstrap-3.3.5/css/bootstrap.min.css" rel="stylesheet" />
<script src="libs/bootstrap-3.3.5/js/bootstrap.min.js"></script>
<script src="libs/bootstrap-3.3.5/shim/html5shiv.min.js"></script>
<script src="libs/bootstrap-3.3.5/shim/respond.min.js"></script>
<script src="libs/navigation-1.1/tabsets.js"></script>
<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
margin: 0; padding: 0; vertical-align: baseline; border: none; }
table.sourceCode { width: 100%; line-height: 100%; background-color: #f8f8f8; }
td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
td.sourceCode { padding-left: 5px; }
pre, code { background-color: #f8f8f8; }
code > span.kw { color: #204a87; font-weight: bold; }
code > span.dt { color: #204a87; }
code > span.dv { color: #0000cf; }
code > span.bn { color: #0000cf; }
code > span.fl { color: #0000cf; }
code > span.ch { color: #4e9a06; }
code > span.st { color: #4e9a06; }
code > span.co { color: #8f5902; font-style: italic; }
code > span.ot { color: #8f5902; }
code > span.al { color: #ef2929; }
code > span.fu { color: #000000; }
code > span.er { font-weight: bold; }
</style>
<style type="text/css">
pre:not([class]) {
background-color: white;
}
</style>
<style type="text/css">
h1 {
font-size: 34px;
}
h1.title {
font-size: 38px;
}
h2 {
font-size: 30px;
}
h3 {
font-size: 24px;
}
h4 {
font-size: 18px;
}
h5 {
font-size: 16px;
}
h6 {
font-size: 12px;
}
.table th:not([align]) {
text-align: left;
}
</style>
</head>
<body>
<style type = "text/css">
.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}
code {
color: inherit;
background-color: rgba(0, 0, 0, 0.04);
}
img {
max-width:100%;
height: auto;
}
.tabbed-pane {
padding-top: 12px;
}
button.code-folding-btn:focus {
outline: none;
}
</style>
<div class="container-fluid main-container">
<!-- tabsets -->
<script>
$(document).ready(function () {
window.buildTabsets("TOC");
});
</script>
<!-- code folding -->
<div class="fluid-row" id="header">
<h1 class="title toc-ignore">Intro to R</h1>
<h4 class="author"><em>Data Carpentry contributors</em></h4>
</div>
<div id="TOC">
<ul>
<li><a href="#basics-of-r">Basics of R</a></li>
<li><a href="#presentation-of-rstudio">Presentation of RStudio</a></li>
<li><a href="#r-as-a-calculator">R as a calculator</a><ul>
<li><a href="#getting-help">Getting help</a></li>
</ul></li>
<li><a href="#need-for-scripts">Need for Scripts</a></li>
<li><a href="#creating-an-r-project">Creating an R Project</a></li>
<li><a href="#interacting-with-r">Interacting with R</a><ul>
<li><a href="#commenting">Commenting</a></li>
</ul></li>
<li><a href="#assignment-operator">Assignment operator</a></li>
<li><a href="#object-names">Object names</a><ul>
<li><a href="#using-named-arguments">Using named arguments</a></li>
<li><a href="#challenge">Challenge</a></li>
</ul></li>
<li><a href="#plunge-straight-into-the-survey-data">Plunge straight into the Survey Data</a><ul>
<li><a href="#objects-in-your-workspace">Objects in your workspace</a></li>
</ul></li>
<li><a href="#data-frames">Data frames</a><ul>
<li><a href="#challenge-1">Challenge</a></li>
<li><a href="#another-summary">Another summary</a></li>
</ul></li>
<li><a href="#inspecting-data-frames">Inspecting data frames</a></li>
<li><a href="#maybe-download-the-file-first">Maybe download the file first</a></li>
<li><a href="#indexing-sequences-and-subsetting">Indexing, Sequences, and Subsetting</a><ul>
<li><a href="#slices">Slices</a></li>
<li><a href="#challenge-2">Challenge</a></li>
</ul></li>
<li><a href="#missing-data">Missing data</a><ul>
<li><a href="#treating-blanks-as-missing">Treating blanks as missing</a></li>
</ul></li>
<li><a href="#factors">Factors</a><ul>
<li><a href="#converting-factors">Converting factors</a></li>
<li><a href="#challenge-3">Challenge</a></li>
<li><a href="#stringsasfactors">stringsAsFactors</a></li>
</ul></li>
</ul>
</div>
<hr />
<blockquote>
<h2>Learning Objectives</h2>
<ul>
<li>load external data (CSV files) in memory using the survey table (<code>surveys.csv</code>) as an example</li>
<li>explore the structure and the content of a data frame in R</li>
<li>understand what factors are and how to manipulate them</li>
<li>understand the concept of a <code>data.frame</code></li>
<li>use sequences</li>
<li>know how to access any element of a <code>data.frame</code></li>
</ul>
</blockquote>
<hr />
<div id="basics-of-r" class="section level2">
<h2>Basics of R</h2>
<p>R is a versatile, open source programming/scripting language that’s useful both for statistics but also data science. Inspired by the programming language S.</p>
<ul>
<li>Free/Libre/Open Source Software under the GPL.</li>
<li>Superior (if not just comparable) to commercial alternatives. R has over 7,000 user contributed packages at this time. It’s widely used both in academia and industry.</li>
<li>Available on all platforms.</li>
<li>Not just for statistics, but also general purpose programming.</li>
<li>For people who have experience in programmming: R is both an object-oriented and a so-called <a href="http://adv-r.had.co.nz/Functional-programming.html">functional language</a>.</li>
<li>Large and growing community of peers.</li>
</ul>
</div>
<div id="presentation-of-rstudio" class="section level2">
<h2>Presentation of RStudio</h2>
<p>Let’s start by learning about RStudio, the Integrated Development Environment (IDE) that we will use to write code, navigate the files found on our computer, inspect the variables we are going to create, and visualize the plots we will generate. RStudio can also be used for other things (e.g., version control, developing packages) that we will not cover during the workshop.</p>
<p>RStudio is divided into 4 “Panes”: the editor for your scripts and documents (top-left), the R console (bottom-left), your environment/history (top-right), and your files/plots/packages/help/viewer (bottom-right). The placement of these panes and their content can be customized.</p>
</div>
<div id="r-as-a-calculator" class="section level2">
<h2>R as a calculator</h2>
<p>The lower-left pane (the R “console”) is where you can interact with R directly. The <code>></code> sign is the R “prompt”. It indicates that R is waiting for you to type something.</p>
<p>You can type <kbd><code>Ctrl</code></kbd> + <kbd><code>Shift</code></kbd> + <kbd><code>2</code></kbd> to focus just on the R console pane. Use <kbd><code>Ctrl</code></kbd> + <kbd><code>Shift</code></kbd> + <kbd><code>0</code></kbd> to get back to the four panes. I use this when teaching but not otherwise.</p>
<p>Let’s start by subtracting a couple of numbers.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="dv">2016</span> -<span class="st"> </span><span class="dv">1969</span></code></pre>
<pre><code>## [1] 47</code></pre>
<p>R does the calculation and prints the result, and then you get the <code>></code> prompt again. We won’t discuss what that number might mean. (The <code>[1]</code> in the results is a bit weird; you can ignore that for now.)</p>
<p>You can use R as a calculator in this way.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="dv">4</span>*<span class="dv">6</span>
<span class="dv">4</span>/<span class="dv">6</span>
<span class="dv">4</span>^<span class="dv">6</span>
<span class="kw">log</span>(<span class="dv">4</span>)
<span class="kw">log10</span>(<span class="dv">4</span>)</code></pre>
<p>Here <code>log</code> is a <em>function</em> that gives you the natural logarithm. Much of your calculations in R will be through functions. The value in the parentheses is called the function “argument”. <code>log10</code> is another function, that calculates the log base 10. There can be more than one argument, and some of them may have default values. <code>log</code> has a second argument <code>base</code> whose default (the value if left unspecified) is <em>e</em>.</p>
<div id="getting-help" class="section level3">
<h3>Getting help</h3>
<p>If you type <code>log</code> and pause for a moment, you’ll get a pop-up with information about the function.</p>
<p>Alternatively, you could type</p>
<pre class="sourceCode r"><code class="sourceCode r">?log</code></pre>
<p>and the documentation for the function will show up in the lower-right pane. These are often a bit <em>too</em> detailed, and so they take some practice to read. I generally focus on Usage and Arguments, and then on Examples at the bottom. We’ll talk more about getting help later.</p>
</div>
</div>
<div id="need-for-scripts" class="section level2">
<h2>Need for Scripts</h2>
<p>We can go along, typing directly into the R console. But there won’t be an easy way to keep track of what we’ve done.</p>
<p>It’s best to write R “scripts” (files with R code), and work from them. And when we start creating scripts, we need to worry about how we organize the scripts and data for a project.</p>
<p>And so let’s pause for a moment and talk about file organization.</p>
</div>
<div id="creating-an-r-project" class="section level2">
<h2>Creating an R Project</h2>
<p>It is good practice to keep a set of related data, analyses, and text self-contained in a single folder, called the <strong>working directory</strong>. All of the scripts within this folder can then use <em>relative paths</em> to files that indicate where inside the project a file is located (as opposed to absolute paths, which point to where a file is on a specific computer). Working this way makes it a lot easier to move your project around on your computer and share it with others without worrying about whether or not the underlying scripts will still work.</p>
<p>RStudio provides a helpful set of tools to do this through its “Projects” interface, which can create a working directory for you (or use an existing one) and also remembers its location (allowing you to quickly navigate to it) and optionally preserves custom settings and open files to make it easier to resume work after a break. Below, we will go through the steps for creating an RProject for this tutorial.</p>
<p>We assume you’ve already got a <code>DataCarpentry</code> folder that you’re going to use.</p>
<ul>
<li>Start RStudio (presentation of RStudio -below- should happen here)</li>
<li>Under the <code>File</code> menu, click on <code>New project</code>, choose <code>Existing directory</code>, then click the “<code>Browse</code>” button and find your <code>DataCarpentry</code> folder.</li>
<li>Click on “Create project”</li>
<li>Create a new R script (File > New File > R script) and save it in your <code>Code</code> subfolder (e.g. <code>~/DataCarpentry/Code</code>)</li>
</ul>
</div>
<div id="interacting-with-r" class="section level2">
<h2>Interacting with R</h2>
<p>While you can type R commands directly at the <code>></code> prompt in the R console, I recommend typing your commands into a script, which you’ll save for later reference, and then executing the commands from there.</p>
<p>Start by typing the following into the R script in the top-left pane.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># R intro</span>
<span class="dv">2016</span> -<span class="st"> </span><span class="dv">1969</span></code></pre>
<p>Save the file clicking the computer disk icon, or by typing <kbd><code>Ctrl</code></kbd> + <kbd><code>S</code></kbd>.</p>
<p>Now place the cursor on the line with <code>2016 - 1969</code> and type <kbd><code>Ctrl</code></kbd> + <kbd><code>Enter</code></kbd>. The command will be copied to the R console and executed, and then the cursor will move to the next line.</p>
<p>You can also highlight a bunch of code and execute the block all at once with <kbd><code>Ctrl</code></kbd> + <kbd><code>Enter</code></kbd>.</p>
<div id="commenting" class="section level3">
<h3>Commenting</h3>
<p>Use <code>#</code> signs to comment. Anything to the right of a <code>#</code> is ignored by R, meaning it won’t be executed. Comments are a great way to describe what your code does within the code itself, so comment liberally in your R scripts.</p>
</div>
</div>
<div id="assignment-operator" class="section level2">
<h2>Assignment operator</h2>
<p>We can assign names to numbers and other objects with the assignment operator, <code><-</code>. For example:</p>
<pre class="sourceCode r"><code class="sourceCode r">age <-<span class="st"> </span><span class="dv">2016-1969</span></code></pre>
<p>Type that into your script, and use <kbd><code>Ctrl</code></kbd> + <kbd><code>Enter</code></kbd> to paste it to the console.</p>
<p>You can also use <code>=</code> as assignment, but that symbol can have other meanings, and so I recommend sticking with the <code><-</code> combination.</p>
<p>In RStudio, typing <kbd>Alt</kbd> + <kbd>-</kbd> will write <code><-</code> in a single keystroke.</p>
<p>If you’ve assigned a number to an object, you can then use it in further calculations:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">sqrt</span>(age)</code></pre>
<pre><code>## [1] 6.855655</code></pre>
<p>You can assigned <em>that</em> to something, and then use it.</p>
<pre class="sourceCode r"><code class="sourceCode r">sqrt_age <-<span class="st"> </span><span class="kw">sqrt</span>(age)
<span class="kw">round</span>(sqrt_age)
<span class="kw">round</span>(sqrt_age, <span class="dv">2</span>)</code></pre>
</div>
<div id="object-names" class="section level2">
<h2>Object names</h2>
<p>Objects can be given any name such as <code>x</code>, <code>current_temperature</code>, or <code>subject_id</code>. You want your object names to be explicit and not too long.</p>
<p>They cannot start with a number (<code>2x</code> is not valid, but <code>x2</code> is).</p>
<p>R is case sensitive (e.g., <code>weight_kg</code> is different from <code>Weight_kg</code>). There are some names that cannot be used because they are the names of fundamental functions in R (e.g., <code>if</code>, <code>else</code>, <code>for</code>, see <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html">here</a> for a complete list).</p>
<p>In general, even if it’s allowed, it’s best to not use other function names (e.g., <code>c</code>, <code>T</code>, <code>mean</code>, <code>data</code>, <code>df</code>, <code>weights</code>). In doubt check the help to see if the name is already in use.</p>
<p>It’s also best to avoid dots (<code>.</code>) within a variable name as in <code>my.dataset</code>. There are many functions in R with dots in their names for historical reasons, but because dots have a special meaning in R (for methods) and other programming languages, it’s best to avoid them. It is also recommended to use nouns for variable names, and verbs for function names. It’s important to be consistent in the styling of your code (where you put spaces, how you name variable, etc.). In R, two popular style guides are <a href="http://adv-r.had.co.nz/Style.html">Hadley Wickham’s</a> and <a href="https://google.github.io/styleguide/Rguide.xml">Google’s</a>.</p>
<div id="using-named-arguments" class="section level3">
<h3>Using named arguments</h3>
<p>For the <code>round()</code> function the 1st argument is the number to round and the 2nd is how many digits to round to. There are some functions that take many arguments and you can imagine that it might get confusing trying to keep them in order. In that case, it is often better to expiclity name the arguments. You can find the argument names using the help methods we discussed earlier.</p>
<p>Here is an example using named arguments with round. When the arguments are named the order doesn’t matter. You might enter the first few important arguments positionally, and later ones by naming them.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">round</span>(<span class="dt">x =</span> sqrt_age, <span class="dt">digits =</span> <span class="dv">2</span>)
<span class="kw">round</span>(<span class="dt">digits =</span> <span class="dv">2</span>, <span class="dt">x =</span> sqrt_age)
<span class="kw">round</span>(sqrt_age, <span class="dt">digits =</span> <span class="dv">2</span>)</code></pre>
<p>In the coming materials we may enter named arguments, as needed.</p>
</div>
<div id="challenge" class="section level3">
<h3>Challenge</h3>
<p>What is the value of <code>y</code> after doing the following?</p>
<pre class="sourceCode r"><code class="sourceCode r">x <-<span class="st"> </span><span class="dv">50</span>
y <-<span class="st"> </span>x *<span class="st"> </span><span class="dv">2</span>
x <-<span class="st"> </span><span class="dv">80</span></code></pre>
<!-- end challenge -->
<p>Objects don’t get linked to each other, so if you change one, it won’t affect the values of any others.</p>
</div>
</div>
<div id="plunge-straight-into-the-survey-data" class="section level2">
<h2>Plunge straight into the Survey Data</h2>
<p>We will continue to look at the species and weight of animals caught in plots in a study area in Arizona over time. The dataset is stored as a CSV file: each row holds information for a single animal, and the columns represent:</p>
<table>
<thead>
<tr class="header">
<th align="left">Column</th>
<th align="left">Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">record_id</td>
<td align="left">Unique id for the observation</td>
</tr>
<tr class="even">
<td align="left">month</td>
<td align="left">month of observation</td>
</tr>
<tr class="odd">
<td align="left">day</td>
<td align="left">day of observation</td>
</tr>
<tr class="even">
<td align="left">year</td>
<td align="left">year of observation</td>
</tr>
<tr class="odd">
<td align="left">plot_id</td>
<td align="left">ID of a particular plot</td>
</tr>
<tr class="even">
<td align="left">species_id</td>
<td align="left">2-letter code</td>
</tr>
<tr class="odd">
<td align="left">sex</td>
<td align="left">sex of animal (“M”, “F”)</td>
</tr>
<tr class="even">
<td align="left">hindfoot_length</td>
<td align="left">length of the hindfoot in mm</td>
</tr>
<tr class="odd">
<td align="left">weight</td>
<td align="left">weight of the animal in grams</td>
</tr>
<tr class="even">
<td align="left">genus</td>
<td align="left">genus of animal</td>
</tr>
<tr class="odd">
<td align="left">species</td>
<td align="left">species of animal</td>
</tr>
<tr class="even">
<td align="left">taxa</td>
<td align="left">e.g. Rodent, Reptile, Bird, Rabbit</td>
</tr>
<tr class="odd">
<td align="left">plot_type</td>
<td align="left">type of plot</td>
</tr>
</tbody>
</table>
<p>The data are available at <a href="http://kbroman.org/datacarp/portal_data_joined.csv" class="uri">http://kbroman.org/datacarp/portal_data_joined.csv</a>.</p>
<p>We can read that data straight from the web, as follows</p>
<pre class="sourceCode r"><code class="sourceCode r">surveys <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">"http://kbroman.org/datacarp/portal_data_joined.csv"</span>)</code></pre>
<div id="objects-in-your-workspace" class="section level3">
<h3>Objects in your workspace</h3>
<p>The objects you create get added to your “workspace”. You can list the current objects with <code>ls()</code>.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ls</span>()</code></pre>
<p>RStudio also shows the objects in the Environment panel.</p>
</div>
</div>
<div id="data-frames" class="section level2">
<h2>Data frames</h2>
<p>The data are stored in what’s called a “data frame”. It’s a big rectangle, with rows being observations and columns being variables. The different columns can be different types (numeric, character, etc.), but they’re all the same length.</p>
<p>Use <code>head()</code> to view the first few rows.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">head</span>(surveys)</code></pre>
<p>Use <code>tail()</code> to view the last few rows.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">tail</span>(surveys)</code></pre>
<p>Use <code>str()</code> to look at the structure of the data.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str</span>(surveys)</code></pre>
<p>This shows that there are 34786 rows and 13 columns, and then for each column, it gives information about the data type and shows the first few values.</p>
<p>For these data, the columns have two types: integer, or “Factor”. Factor columns are text with a discrete set of possible values. We’ll come back to this in a bit.</p>
<div id="challenge-1" class="section level3">
<h3>Challenge</h3>
<p>Study the output of <code>str(surveys)</code>. How are the missing values being treated?</p>
<!-- end challenge -->
</div>
<div id="another-summary" class="section level3">
<h3>Another summary</h3>
<p>Another useful function in <code>summary()</code>.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">summary</span>(surveys)</code></pre>
<p>For the numeric data, you get six statistics (min, 25th percentile, median, mean, 75th percentile, and max). For the factors, you get a table with counts for the most-frequent “levels”, and then for the rest.</p>
</div>
</div>
<div id="inspecting-data-frames" class="section level2">
<h2>Inspecting data frames</h2>
<p>We already saw how the functions <code>head()</code> and <code>str()</code> can be useful to check the content and the structure of a <code>data.frame</code>. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data.</p>
<ul>
<li>Size:
<ul>
<li><code>dim()</code> - returns a vector with the number of rows in the first element, and the number of columns as the second element (the __dim__ensions of the object)</li>
<li><code>nrow()</code> - returns the number of rows</li>
<li><code>ncol()</code> - returns the number of columns</li>
</ul></li>
<li>Content:
<ul>
<li><code>head()</code> - shows the first 6 rows</li>
<li><code>tail()</code> - shows the last 6 rows</li>
</ul></li>
<li>Names:
<ul>
<li><code>names()</code> - returns the column names (synonym of <code>colnames()</code> for <code>data.frame</code> objects)</li>
<li><code>rownames()</code> - returns the row names</li>
</ul></li>
<li>Summary:
<ul>
<li><code>str()</code> - structure of the object and information about the class, length and content of each column</li>
<li><code>summary()</code> - summary statistics for each column</li>
</ul></li>
</ul>
<p>Note: most of these functions are “generic”, they can be used on other types of objects besides <code>data.frame</code>.</p>
</div>
<div id="maybe-download-the-file-first" class="section level2">
<h2>Maybe download the file first</h2>
<p>It probably is best to first download the file, so that we have a local copy, and then read it in.</p>
<p>We could first use <code>download_file()</code> to download the data into the <code>CleanData/</code> subdirectory:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">download.file</span>(<span class="st">"http://kbroman.org/datacarp/portal_data_joined.csv"</span>,
<span class="st">"CleanData/portal_data_joined.csv"</span>)</code></pre>
<p>We could then use <code>read.csv()</code> to load the data into R, from the downloaded file:</p>
<pre class="sourceCode r"><code class="sourceCode r">surveys <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">'CleanData/portal_data_joined.csv'</span>)</code></pre>
</div>
<div id="indexing-sequences-and-subsetting" class="section level2">
<h2>Indexing, Sequences, and Subsetting</h2>
<p>We can pull out parts of a data frame using square brackets. We need to provide two values: row and column, with a comma between them.</p>
<p>For example, to get the element in the 1st row, 1st column:</p>
<pre class="sourceCode r"><code class="sourceCode r">surveys[<span class="dv">1</span>,<span class="dv">1</span>]</code></pre>
<p>To get the element in the 2nd row, 7th column:</p>
<pre class="sourceCode r"><code class="sourceCode r">surveys[<span class="dv">2</span>,<span class="dv">7</span>]</code></pre>
<p>To get the entire 2nd row, leave the column part blank:</p>
<pre class="sourceCode r"><code class="sourceCode r">surveys[<span class="dv">2</span>,]</code></pre>
<p>And to get the entire 7th column, leave the row part blank:</p>
<pre class="sourceCode r"><code class="sourceCode r">sex <-<span class="st"> </span>surveys[,<span class="dv">7</span>]</code></pre>
<p>You can also refer to columns by name, in multiple ways.</p>
<pre class="sourceCode r"><code class="sourceCode r">sex <-<span class="st"> </span>surveys[, <span class="st">"sex"</span>]
sex <-<span class="st"> </span>surveys$sex
sex <-<span class="st"> </span>surveys[[<span class="st">"sex"</span>]]</code></pre>
<p>When we pull out a single column, the result is a “vector”. We can again use square brackets to pull out individual values, but providing a single number.</p>
<pre class="sourceCode r"><code class="sourceCode r">sex[<span class="dv">1</span>]
sex[<span class="dv">10000</span>]</code></pre>
<div id="slices" class="section level3">
<h3>Slices</h3>
<p>You can pull out larger slices from the vector by providing vectors of indices. You can use the <code>c()</code> function to create a vector.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">c</span>(<span class="dv">1</span>,<span class="dv">3</span>,<span class="dv">5</span>)
sex[<span class="kw">c</span>(<span class="dv">1</span>,<span class="dv">3</span>,<span class="dv">5</span>)]</code></pre>
<p>To pull out larger slices, it’s helpful to have ways of creating sequences of numbers.</p>
<p>First, the operator <code>:</code> gives you a sequence of consecutive values.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="dv">1</span>:<span class="dv">10</span>
<span class="dv">10</span>:<span class="dv">1</span>
<span class="dv">5</span>:<span class="dv">8</span></code></pre>
<p><code>seq</code> is more flexible.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">seq</span>(<span class="dv">1</span>, <span class="dv">10</span>, <span class="dt">by=</span><span class="dv">2</span>)
<span class="kw">seq</span>(<span class="dv">5</span>, <span class="dv">10</span>, <span class="dt">length.out=</span><span class="dv">3</span>)
<span class="kw">seq</span>(<span class="dv">50</span>, <span class="dt">by=</span><span class="dv">5</span>, <span class="dt">length.out=</span><span class="dv">10</span>)
<span class="kw">seq</span>(<span class="dv">1</span>, <span class="dv">8</span>, <span class="dt">by=</span><span class="dv">3</span>) <span class="co"># sequence stops to stay below upper limit</span>
<span class="kw">seq</span>(<span class="dv">10</span>, <span class="dv">2</span>, <span class="dt">by=</span>-<span class="dv">2</span>) <span class="co"># can also go backwards</span></code></pre>
<p>To get slices of our data frame, we can include a vector for the row or column indexes (or both)</p>
<pre class="sourceCode r"><code class="sourceCode r">surveys[<span class="dv">1</span>:<span class="dv">3</span>, <span class="dv">7</span>] <span class="co"># first three elements in the 7th column</span>
surveys[<span class="dv">1</span>, <span class="dv">1</span>:<span class="dv">3</span>] <span class="co"># first three columns in the first row</span>
surveys[<span class="dv">2</span>:<span class="dv">4</span>, <span class="dv">6</span>:<span class="dv">7</span>] <span class="co"># rows 2-4, columns 6-7</span></code></pre>
</div>
<div id="challenge-2" class="section level3">
<h3>Challenge</h3>
<p>The function <code>nrow()</code> on a <code>data.frame</code> returns the number of rows.</p>
<p>Use <code>nrow()</code>, in conjuction with <code>seq()</code> to create a new <code>data.frame</code> called <code>surveys_by_10</code> that includes every 10th row of the survey data frame starting at row 10 (10, 20, 30, …)</p>
<!-- end challenge -->
</div>
</div>
<div id="missing-data" class="section level2">
<h2>Missing data</h2>
<p>As R was designed to analyze datasets, it includes the concept of missing data (which is uncommon in other programming languages). Missing data are represented in vectors as <code>NA</code>.</p>
<pre class="sourceCode r"><code class="sourceCode r">heights <-<span class="st"> </span><span class="kw">c</span>(<span class="dv">2</span>, <span class="dv">4</span>, <span class="dv">4</span>, <span class="ot">NA</span>, <span class="dv">6</span>)</code></pre>
<p>When doing operations on numbers, most functions will return <code>NA</code> if the data you are working with include missing values. It is a safer behavior as otherwise you may overlook that you are dealing with missing data. You can add the argument <code>na.rm=TRUE</code> to calculate the result while ignoring the missing values.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">mean</span>(heights)
<span class="kw">max</span>(heights)
<span class="kw">mean</span>(heights, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>)
<span class="kw">max</span>(heights, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>)</code></pre>
<p>If your data include missing values, you may want to become familiar with the functions <code>is.na()</code> and <code>na.omit()</code>.</p>
<pre class="sourceCode r"><code class="sourceCode r">## Extract those elements which are not missing values.
heights[!<span class="kw">is.na</span>(heights)]
## shortcut to that
<span class="kw">na.omit</span>(heights)</code></pre>
<div id="treating-blanks-as-missing" class="section level3">
<h3>Treating blanks as missing</h3>
<p>Another aside: it’s probably best to treat those blanks as missing (<code>NA</code>). To do that, use the argument <code>na.strings</code> when reading the data. <code>na.strings</code> can be a vector of multiple character strings. We need that a missing value code can never exist as a valid value, because they all will be converted to the missing data code <code>NA</code>. And note that the default for <code>na.strings</code> is <code>"NA"</code>, which will cause problems if <code>"NA"</code> is a valid value for your data (e.g., as an abbreviation <code>"North America"</code>).</p>
<pre class="sourceCode r"><code class="sourceCode r">surveys_noblanks <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">"CleanData/portal_data_joined.csv"</span>, <span class="dt">na.strings=</span><span class="st">""</span>)</code></pre>
</div>
</div>
<div id="factors" class="section level2">
<h2>Factors</h2>
<p>Factors are used to represent categorical data. Factors can be ordered or unordered, and understanding them is necessary for statistical analysis and for plotting.</p>
<p>Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.</p>
<p>Once created, factors can only contain a pre-defined set of values, known as <em>levels</em>. By default, R always sorts <em>levels</em> in alphabetical order. For instance, if you use <code>factor()</code> to create a factor with 2 levels:</p>
<pre class="sourceCode r"><code class="sourceCode r">sex <-<span class="st"> </span><span class="kw">factor</span>(<span class="kw">c</span>(<span class="st">"male"</span>, <span class="st">"female"</span>, <span class="st">"female"</span>, <span class="st">"male"</span>))</code></pre>
<p>R will assign <code>1</code> to the level <code>"female"</code> and <code>2</code> to the level <code>"male"</code> (because <code>f</code> comes before <code>m</code>, even though the first element in this vector is <code>"male"</code>). You can check this by using the function <code>levels()</code>, and check the number of levels using <code>nlevels()</code>:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">levels</span>(sex)
<span class="kw">nlevels</span>(sex)</code></pre>
<p>Sometimes, the order of the factors does not matter, other times you might want to specify a particular order.</p>
<pre class="sourceCode r"><code class="sourceCode r">food <-<span class="st"> </span><span class="kw">factor</span>(<span class="kw">c</span>(<span class="st">"low"</span>, <span class="st">"high"</span>, <span class="st">"medium"</span>, <span class="st">"high"</span>, <span class="st">"low"</span>, <span class="st">"medium"</span>, <span class="st">"high"</span>))
<span class="kw">levels</span>(food)
food <-<span class="st"> </span><span class="kw">factor</span>(food, <span class="dt">levels=</span><span class="kw">c</span>(<span class="st">"low"</span>, <span class="st">"medium"</span>, <span class="st">"high"</span>))
<span class="kw">levels</span>(food)</code></pre>
<div id="converting-factors" class="section level3">
<h3>Converting factors</h3>
<p>If you need to convert a factor to a character vector, you use <code>as.character(x)</code>.</p>
<p>Converting factors where the levels appear as numbers (such as concentration levels) to a numeric vector is a little trickier. One method is to convert factors to characters and then numbers. function. Compare:</p>
<pre class="sourceCode r"><code class="sourceCode r">f <-<span class="st"> </span><span class="kw">factor</span>(<span class="kw">c</span>(<span class="dv">1</span>, <span class="dv">5</span>, <span class="dv">10</span>, <span class="dv">2</span>))
<span class="kw">as.numeric</span>(f) ## wrong! and there is no warning...
<span class="kw">as.numeric</span>(<span class="kw">as.character</span>(f)) ## works...</code></pre>
</div>
<div id="challenge-3" class="section level3">
<h3>Challenge</h3>
<p>The function <code>table()</code> tabulates observations.</p>
<pre class="sourceCode r"><code class="sourceCode r">expt <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"treat1"</span>, <span class="st">"treat2"</span>, <span class="st">"treat1"</span>, <span class="st">"treat3"</span>, <span class="st">"treat1"</span>,
<span class="st">"control"</span>, <span class="st">"treat1"</span>, <span class="st">"treat2"</span>, <span class="st">"treat3"</span>)
expt <-<span class="st"> </span><span class="kw">factor</span>(expt)
<span class="kw">table</span>(expt)</code></pre>
<ul>
<li>In which order are the treatments listed?</li>
<li>How can you recreate this table with “<code>control</code>” listed last instead of first?</li>
</ul>
<!-- end challenge -->
<!---
```r
## Answers
##
## * The treatments are listed in alphabetical order because they are factors.
## * By redefining the order of the levels
expt <- c("treat1", "treat2", "treat1", "treat3", "treat1",
"control", "treat1", "treat2", "treat3")
expt <- factor(expt, levels=c("treat1", "treat2", "treat3", "control"))
table(expt)
```
--->
</div>
<div id="stringsasfactors" class="section level3">
<h3>stringsAsFactors</h3>
<p>The default when reading in data with <code>read.csv()</code>, columns with text get turned into factors.</p>
<p>You can avoid this with the argument <code>stringsAsFactors=FALSE</code>.</p>
<pre class="sourceCode r"><code class="sourceCode r">surveys_chr <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">"CleanData/portal_data_joined.csv"</span>, <span class="dt">stringsAsFactors=</span><span class="ot">FALSE</span>)</code></pre>
<p>Then when you look at the result of <code>str()</code>, you’ll see that the previously factor columns are now <code>chr</code>.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str</span>(surveys_chr)</code></pre>
</div>
</div>
</div>
<script>
// add bootstrap table styles to pandoc tables
function bootstrapStylePandocTables() {
$('tr.header').parent('thead').parent('table').addClass('table table-condensed');
}
$(document).ready(function () {
bootstrapStylePandocTables();
});
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
</body>
</html>