R is a high-level langauge, popular for running statistical analyses and short data-processing scripts.
While most users use R with one-off scripts, there are several reasons for creating an R package, including:
- It is easier to dissemminate your code (for example, if you were publishing your code along with your paper - e.g. https://github.com/chris-mcginnis-ucsf/MULTI-seq).
- It may be easier for a user to collate standard analyses into functions in a common interface. (e.g. https://github.com/YosefLab/VISION)
- It is easier to version-control a package than individual scripts.
Before we get started, you'll want to make sure you have the following pieces of software installed:
- R: https://www.r-project.org/
- The latest R version is 3.6.1 (as of 2019-10-16), though if you have an earlier version that should be fine.
- Rstudio: https://rstudio.com/
- This is just a nice development environment.
- The R package
devtools
, which can be done from within R:install.packages('devtools')
. This is an extremely useful package for everything R-development. - The R package
roxygen2
which you can install from github as so:devtools::install_github('klutometis/roxygen')
. Roxygen will help you generate documentation. - The R package
usethis
which you can install asinstall.packages("usethis")
- The R package
testthis
which you can install asinstall.packages("testthis")
R packages range in complexity, but you need to hav the following items:
- DESCRIPTION: this stores package metadata, including its name, version, authors, license, and required or suggested packages.
- NAMESPACE: a text file storing information pertaining to which functions you'd like to import from external packages, or export from your own package for users.
- R/: a folder that will store all source code for your package.
- man/: a folder storing all code documentation (refer to the section below entitled "Creating Documentation")
Optionally, you can include these items:
- LICENSE: a text file storing your license information.
- vignettes/ a folder storing R vignettes (i.e. tutorials). These are kind of like static jupyter notebooks that users can follow.
- data/: a folder that stores any required package data.
- tests/: a folder storing testing scripts.
There are a couple of ways to begin an R package. Of course you can do this manually, but we'll create a new package using Rstudio.
1. Begin by creating a new projects from File>New Project
:
2. Then select the correct type of project you'd like to start.
3. Name the project and define where you'd like to put it.
After doing this, you will now have a folder called BaseballStats
in your working directory, with a folder for putting your code, R
, a folder for putting any documentation, man
, as well two files for storing important package metadata, DESCRIPTION and NAMESPACE.
Alternatives You can also create R packages with devtools::create
, devtools::pacakge.skeleton
, or with usethis::create_pacakge
.
The R directory will store all of your R code. If you have code already you're trying to package up you can move this over to the R directory. Else, you can begin writing code. For example, we'll add a file called statistics.R
and add a single function for now:
compute_average <- function(num_hits, num_at_bats) {
if (num_at_bats < 1) {
stop("You need at least one at bat for a batting average!")
}
return(num_hits / num_at_bats)
}
You can make sure that this function is working by loading the package with devtools
and tesing out your function:
require(devtools)
load_all('.')
compute_average(10, 50)
[1] 0.2
Creating documentation in R with Roxygen is extremely easy. Roxygen will take your comments beginning with a #'
character before every function and automatically generate documention. For example, using our function from before we can add the following documentation:
#' A function for computing a batter's average.
#'
#' This function takes in the number of at-bats and number of hits and will return an average.
#' @param num_hits Number of hits
#' @param num_at_bats Number of at bats
#' @return The batting average.
#' @examples
#' compute_average(10, 50)
compute_average <- function(num_hits, num_at_bats) {
if (num_at_bats < 1) {
stop("You need at least one at bat for a batting average!")
}
return(num_hits / num_at_bats)
}
Using the function devtools::document
will populate the man/
directory with your documentation. After running this function, you should notice two new .Rd
files in your man/
directory: BaseballStats-package.Rd
and compute_average.Rd
.
After running the function devtools::document
, you can see your new documentation with the call ?compute_average
.
For more advanced packages, you may want to develop from an object oriented (OO) perspective. Such an approach could be nice for wrapping up data (e.g. a gene expression matrix) or the parameters around an analysis to replicate downstream (e.g. the filtered gene list, normalization method, etc.).
R supports three different object oriented programming paradigms: S3, S4, and R5. Here we'll focus on S4 classes which are extremely flexible and similar to other object-oriented systems.
To begin, we'll create a file for declaring all classes and what data can be stored in each instance. This will go in the AllClasses.R
file.
We'll begin by creating two classes: Player
and Club
.
In AllClasses.R
we'll add the following code:
setClassUnion('numericORNULL', members=c('numeric', 'NULL'))
Player <- setClass("Player",
slots = c(
name = "character",
num_at_bats = "numeric",
num_hits = "numeric",
is_pitcher = 'logical',
era = 'numericORNULL'),
prototype = list(
name = character(),
num_at_bats = 0,
num_hits = 0,
is_pitcher = FALSE,
era = NULL
))
Club <- setClass("Club",
slots = c(
name = 'character',
city = 'character',
winning_percentage = 'numeric',
players = 'list'),
prototype = list(
name = character(),
city = character(),
winning_percentage = 0.0,
players = list()
))
S4 objects need to be instantiatied using the new
operator (e.g. player = new('Player', ...
)).
To support classic object-oriented functionality, you might want to create class-specific functions, including a generator function akin to python's __init__
function.
For readability, you can create an .R
. file for each class you have - for instance methods-Player.R
. We'll add the following code to methods-Player.R
:
#' Initialize a new Player object.
#'
#' @param name Name of the player
#' @param num_at_bats Number of at-bats the player has had
#' @param num_hits Number of hits the player has had
#' @param is_pitcher Boolean indicating whether or not the player pitches
#' @param era The Earned Run Average (ERA) for a pitcher
#' @return Player object
Player <- function(name = "", num_at_bats = 0.0, num_hits = 0.0,
is_pitcher = FALSE, era = NULL) {
.Object <- new('Player', name = name, num_at_bats = num_at_bats,
num_hits = num_hits, is_pitcher = is_pitcher, era = era)
return(.Object)
}
#' Compute a Player's batting average
#'
#' @param object A Player
#' @return The player's batting average
setMethod("compute_batting_average", signature(object = "Player"),
function(object) {
return(compute_average(object@num_hits, object@num_at_bats))
})
As for the Club
class we'll add to a file called methods-Club.R
:
Club <- function(name = "", city = "", winning_percentage = NULL,
players = list(), num_wins = 0, num_games = 0) {
if (is.null(winning_percentage)) {
if (num_games < 1) {
winning_percentage = 0
} else {
winning_percentage = num_wins / num_games
}
}
.Object <- new('Club', name = name, city = city,
winning_percentage = winning_percentage, players = players)
return(.Object)
}
#' Compute the team's batting average
#'
#' @param object A Club
#' @return The Club's batting average
setMethod("compute_batting_average", signature(object = "Club"),
function(object) {
num_hits = sum(sapply(object@players, function(x) x@num_hits))
num_at_bats = sum(sapply(object@players, function(x) x@num_at_bats))
return(compute_average(num_hits, num_at_bats))
})
You may notice that both objects have a function called compute_batting_average
. This takes advantage of R's multiple dispatch feature, meaning that it will look for the object's "signature" before calling the function. In order to take advantage of this, we'll need to create one more file in R/
that will store these generics that support multiple dispatch: AllGenerics.R
. We'll add a single generic for now to AllGenerics.R
:
setGeneric("compute_batting_average", function(object, ...) {
standardGeneric("compute_batting_average")
})
Now we can get this functionality:
devtools::load_all()
p1 = Player('Babe Ruth', num_hits = 1000, num_at_bats = 2000)
p2 = Player('Joe Dimaggio', num_hits = 3000, num_at_bats = 5000)
yankees = Club(name = 'yankees', city='New York', players = list(p1, p2))
compute_batting_average(p1)
[1] 0.5
compute_batting_average(yankees)
[1] 0.5714286
A great practice in creating your R packages is to add testing functionality. This will allows users to make sure that they've installed your package correctly, as well as help you notice any functionality you may have broken in an update to the package.
For this purpose, I recommend the testthat
package. To get started, we'll first use the package usethis
to create our testing environment. As a note, usethis
has a ton of other great functionality for adding structure to your package, but for now we'll use it to introduce unit tests.
We'll add one example test, testing that the batting average of a player is computed correctly:
require(usethis)
use_test("player")
✔ Setting active project to '/Users/student/BaseballStats'
✔ Adding 'testthat' to Suggests field in DESCRIPTION
✔ Creating 'tests/testthat/'
✔ Writing 'tests/testthat.R'
● Call `use_test()` to initialize a basic test file and open it for editing.
✔ Increasing 'testthat' version to '>= 2.1.0' in DESCRIPTION
✔ Writing 'tests/testthat/test-player.R'
● Modify 'tests/testthat/test-player.R'
You'll now notice that a window pops up to edit the new test file that you've added. You can add the following test:
test_that("player batting average works", {
p1 = Player('joe', num_hits= 100, num_at_bats = 500)
expect_equal(compute_batting_average(p1), 0.2)
})
You can make sure your tests work by using devtools::test()
.
Before publishing your code, you'll want to update your DESCRIPTION and NAMESPACE, as well as add README to your package.
As noted before the DESCRIPTION will provide information related to the authors of the package, license, as well as any dependencies.
The NAMESPACE will let your package installer know which functions you'd like to make available from your package (by exporting them) and which functions you'd like to use from your dependencies (by importing them).
The README will provide the user with any more information about the package (similar to a github's README).
A good DESCRIPTION will look something like this:
Package: BaseballStats
Type: Package
Title: A package for computing statistics for baseball players
Version: 1.0
Date: 2019-10-14
Author: Matt Jones
Maintainer: Matt Jones <matts_email@email.com>
Description: BaseballStats provides an interface for storing player & ball club information as well as computing statistics around these items.
License: MIT
RoxygenNote: 6.1.1
Suggests:
testthat (>= 2.1.0)
ggplot2
knitr
For now, you can leave the NAMESPACE as is, exporting all functions.
Also to note is that you can provide a license (e.g. MIT) by using the usethis
package -- for example usethis::use_mit_license("Matt Jones")
. This will add to your DESCRIPTOIN file as well as create a new LICENSE file for your pacakge.
To make sure that everything works correctly, you can run R CMD check .
via command line from within your package directory.
The easiest way to publish your R package is on github. This can be done simply by creating a repository for your code (e.g. https://github.com/mattjones315/BaseballStats). Users then can install your package with devtools using devtools::install_github('mattjones315/BaseballStats
), for example.
This can be really nice for making your code available while it's under development.
CRAN is the default package server for R, and requires a bit more information before publishing. Firstly, you'll want to update your README.md and create a new file called NEWS.md in the package home directory that details any new updates for each version bump. You can look at the NEWS.md of one of our recent packages.
You'll next want to use the command line function R CMD check .
to run, document, and test your code base.
Now, to submit your package to CRAN you need to build the package using devtools::build()
(which will create a package bundle) and then manually upload this to http://cran.r-project.org/submit.html. These submissions are vetted by volunteers and Hadley Wickam has some great advice around the entire submssion process, namely with how to make these gatekeepers look favorably upon your package: http://r-pkgs.had.co.nz/release.html.
Lastly, for specifically biology-related packages you can submit to Bioconductor to be hosted on their specific servers. This process tends to be a bit more rigorous but looks similar to a CRAN submssion. The full process can be found here.
Vignettes are crucial parts of any R package, and primarily serve as an introduction to your package or as a tutorial for new features. To start your first vignette, I recommend using the usethis::use_vignette
:
require(usethis)
use_vignette('BaseballStats-Intro', 'How to get started with BaseballStats')
If you're in Rstudio, this will automatically open up an Rmarkdown (.Rmd
) file for you to edit. You'll notice that the key entries are already populated (e.g. the header and setup entries).
Rmarkdown writes very similarly to regular markdown, with the exception that it is compiled with knitr
and is meant to have lots of embedded R code.
When you're done with creating your vignette, you can see how it looks by clicking on the knit
key at the top of your Rstudio screen.
For a complete style guide, refer to this webpage
R is infamous for slow compute times and poor memory management, the confluence of which may preclude any serious data scientist from using vanilla R for analyses.
To circumvent this, you can leverage Rcpp
which serves as an R wrapper for C++ code so that you can call actual C++ code from within R. You can get started with this from usethis::use_rcpp()
. This will create a directory for your C++ code, src/
, as well as add the required dependencies to your DESCRIPTION file.
Before moving on, make sure that your NAMESPACE is correct - it should include these two lines:
useDynLib(BaseballStats)
importFrom(Rcpp, sourceCpp)
Read more about how to exactly use Rcpp from this great blog post
After creating a package you're ready to share with the world, you may want to create a website for hosting all documentation along the lines of readthedocs
. One great way to do this is with pkgdown
. After installing pkgdown
, you can create your website as easily as with
pkgdown::build_site()
pkgdown
works similarly to Roxygen, in the sense it takes work you've already done and populates a website for you. Using the build_site
function will automatically generate a new folder, docs/
which will store your .html files as well as give you a preview of your new package website!
Shiny apps are powerful interactive web applications that are written in R. While you can develop a custom web-based application with your own html, css, and javascript code, Shiny provides a convenient approach that will compile your R code into the requisite web-based code.
While creating a Shiny app falls out of the purview of this tutorial, refer to a great tutorial here by Zev Ross to get started on your first Shiny app.