Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,7 @@
inst/doc

*.code-workspace
.codegpt

# Python
.env
2 changes: 2 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,11 @@ Suggests:
kableExtra,
knitr,
readr,
reticulate,
rmarkdown,
survival,
testthat (>= 3.0.0),
tibble
Config/testthat/edition: 3
VignetteBuilder: knitr

2 changes: 2 additions & 0 deletions scope-docs/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
/.quarto/
dist
23 changes: 23 additions & 0 deletions scope-docs/_quarto.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
project:
type: website
output-dir: ./dist

website:
title: v1.0 Scope
sidebar:
style: docked
search: true
contents:
- section: Scope
contents:
- metadata.qmd
- labels.qmd
- missing-data.qmd
- derived-variables.qmd
- logging.qmd
- versioning.qmd
- out-of-scope.qmd
format:
html:
theme: cosmo
toc: true
244 changes: 244 additions & 0 deletions scope-docs/derived-variables.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
---
title: Derived Variables
format:
html:
embed-resources: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = FALSE,
warning = TRUE,
message = TRUE,
error = FALSE
)

# Load required packages
library(haven)
library(reticulate)
library(knitr)
library(magrittr)
```

The variable details sheet within the recodeflow library enables a user to
encode the rules to create a new variable from one or more starting variables.
Broadly, any rule that maps the values of one variable into another can be
encoded in the variable details sheet. Examples include rules that map one or
more categories of a variable into a new category or a rule that maps an
interval of a continuous variable into a category. Variables that cannot be
created with simple mapping rules can be recoded by using a function. Examples
include variables that require mathematical operations and conditional logic.

Any variable that requires a function for its creation is considered a derived
variable. A good example of this is BMI which requires mathematical operations
for its calculation and optionally conditional logic to validate its inputs.
This document will go over the scope and specifications for such derived
variables and the functions used to derive them.

## Function Usage

Users should be able to use any derived functions without needing to install
recodeflow. The library should not force users to write functions in any way
that prevents them from for example being copy-pasted and easily used in an R
environment where recodeflow is not installed. A number of organizations host
personal data and thus impose restrictions on the use of external libraries.
Allowing derived functions to be used without recodeflow enables the analysts
within these organizations to use derived functions defined in flow universe
packages like [cchsflow](https://big-life-lab.github.io/cchsflow/) enabling
collaboration.

## Function Language

The library should support only functions written in R. The R ecosystem is
moving towards allowing its users to execute code from other statistical
languages. An example is the
[reticulate](https://rstudio.github.io/reticulate/index.html) package which
enables the execution of arbitrary Python code within a .R file. Example below,

```{r, python-in-R, echo=TRUE}
# Python code that creates a function to calculate BMI
BMI_python_code <- "def BMI(weight, height):
return weight/(height*height)
"
# Run reticulate to bring the Python function over to the R environment enabling
# us to call it using R code
reticulate_output <- reticulate::py_run_string(BMI_python_code)
# Calculate BMI using the created Python function
BMI_python <- reticulate_output$BMI(25, 5)
# Expect the calculated BMI to be 1
stopifnot(BMI_python == 1)
print(paste("Calculated BMI:", BMI_python))
```

To reduce the library scope, currently only R functions will be allowed.

## Function Inputs and Outputs

The function inputs and output should be standardized to accept either scalar or
vector values.

For example, a function that calculates BMI will take as input height and
weight. The library should make a decision on whether the inputs can be scalar
values for example `80` and `180` or vector values for example `c(80)` and
`c(180)`.

This decision will have implications on the performance, user experience, and
portability of the function as described below.

R being a statistical language is optimized to run on vectors. However, a
function accepting scalar inputs can be easily vectorized using the
[Vectorize](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Vectorize.html)
function available in base R.

The code below displays a table that compares the performance of calculating BMI
using:

1. Scalar inputs
2. Vector inputs
3. A vectorized version of 1 created using the `Vectorize` function

```{r}
scalar_BMI <- function(weight, height) {
bmi <- NA
if(is.na(weight) | is.na(height)) {
bmi <- NA
}
bmi <- weight/(height*height)
}

vector_BMI <- function(weight, height) {
bmi <- ifelse(
is.na(weight) | is.na(height),
NA,
weight/(height*height)
)
}

vectorized_BMI <- Vectorize(scalar_BMI)

# Create larger test dataset for better comparison
test_data <- data.frame(
weight = sample(1:100, 1000, replace = TRUE),
height = sample(1:100, 1000, replace = TRUE)
)

# In milliseconds
get_func_perf <- function(func) {
start_time <- Sys.time()
func()
end_time <- Sys.time()
run_time <- end_time - start_time
return(as.numeric(run_time)*1000)
}

# Original methods
scalar_run_time <- get_func_perf(function() {
bmi <- c()
for(i in 1:nrow(test_data)) {
bmi <- c(bmi, scalar_BMI(test_data[i, "weight"], test_data[i, "height"]))
}
})

vector_run_time <- get_func_perf(function() {
vector_BMI(test_data$weight, test_data$height)
})

vectorized_run_time <- get_func_perf(function() {
vectorized_BMI(test_data$weight, test_data$height)
})

# Dplyr methods - testing only vector_BMI and vectorized_BMI
dplyr_vector_run_time <- get_func_perf(function() {
test_data %>%
dplyr::mutate(bmi = vector_BMI(weight, height))
})

dplyr_vectorized_run_time <- get_func_perf(function() {
test_data %>%
dplyr::mutate(bmi = vectorized_BMI(weight, height))
})

calculate_percent_diff <- function(num1, num2) {
return((abs(num1 - num2)/((num1 + num2)/2))*100)
}

# Enhanced results table with all valid methods
run_time_table <- data.frame(
"Type" = c("Scalar",
"Vector",
"Vectorized",
"Dplyr (vector_BMI)",
"Dplyr (vectorized_BMI)"),
"Run Time in ms" = c(scalar_run_time,
vector_run_time,
vectorized_run_time,
dplyr_vector_run_time,
dplyr_vectorized_run_time),
"Percent Difference" = c(
"Reference",
calculate_percent_diff(scalar_run_time, vector_run_time),
calculate_percent_diff(scalar_run_time, vectorized_run_time),
calculate_percent_diff(scalar_run_time, dplyr_vector_run_time),
calculate_percent_diff(scalar_run_time, dplyr_vectorized_run_time)
)
)

knitr::kable(run_time_table)
```

In summary, in the order from slowest to fastest we have scalar, vectorized, and
vector.

Ergonomically, users within R may be more comfortable with vectors since
datasets are usually in data.frames and individual columns can be plucked out as
vectors using the \$ operator. Users of non-statistical languages will be more
comfortable with scalars since they usually deal with scalar values.

Scalar values are more easily portable to non-statistical languages which
becomes a concern if the recoding rules need to be run for example in a web
browser where Javascript is the programming language of choice.

## Supported Operations

The library should allow the user to user to use any R code within the function.
For example, code that uses external libraries, sources other R files etc.

## Tagged NA

The need for tagged NA values has been described elsewhere. An issue with
derived functions is that users can easily return NA values that are not tagged.
The library should provide appropriate warnings to users when an NA value is not
tagged or tagged with an un-recognized value. For example,

```{r}
BMI_non_tagged_NA <- function(weight, height) {
if(is.na(weight) | is.na(height)) {
return(NA)
}
return(weight/height*height)
}

BMI_tagged_NA <- function(weight, height) {
if(is.na(weight) | is.na(height)) {
return(haven::tagged_na("a"))
}
return(weight/height*height)
}
```

The BMI_non_tagged_NA function when run by the library should warn the user that
they have returned an NA value that is not tagged. The BMI_tagged_NA function
should pass with no issues since the user is returning a tagged NA value.

## Best practices

Keeping in mind that users will be writing derived functions on their own, the
library should provide guidance on how to write good functions. This can include
but is not limited to:

- Properly documenting the function behaviour, inputs, and outputs
- Providing examples of running the function using different input types like
scalars, vectors, data frames etc.
- Providing test cases for the functions
- Handling invalid input values and making validations like mins and maxes
easily customizable by other users by including them as function parameters
63 changes: 63 additions & 0 deletions scope-docs/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
---
format:
html:
embed-resources: true
---

The goal of the recodeflow library is to enable the open, transparent, and
reproducible creation of a study dataset.

One of the first steps in any quantitative study is the selection and creation
of variables that will be used to answer the research question. However, this
process is rarely conducted transparently, resulting in studies that are:

1. **Inefficient**: Research teams' previous work is seldom transferable to new
studies, even if they use the same variables.

2. **Non-reproducible**: The lack of transparency makes it difficult to
reproduce a published study without extensive help from the original
authors. If enough time has passed since publication, even the original
authors may not remember how the study variables were created.

Increasingly, studies use many variables and complex data transformations,
making it difficult to reproduce findings and apply methodologies to new
datasets.

The recodeflow library addresses these issues by:

**1.** Encoding all the recoding rules in a set of CSV sheets that are
transparent and machine-actionable; and

**2.** Providing software to create a study dataset using the CSV sheets.

While the library meets its core objective, user feedback has identified several
issues for improvement, including:

- Users not knowing why a function failure has taken place (not enough logging).
- The documentation not catering to different types of users (not using the
divio style of documentation).
- The function having too many parameters (complex API) etc.

As well, the version clarifies and expands the use of metadata within the data
analyses workflow, including:

- better support or using labels in tables and graphs;
- updating the use of 'tagged_NAs' for missing data.

This document contains the scope for the next version of the library that aims
to address the above and other issues. Explore the scope for different parts of
the library by clicking on one of links below.

[Metadata](metadata.qmd)

[Labels](labels.qmd)

[Missing data](missing-data.qmd)

[Derived Variables](derived-variables.qmd)

[Logging](logging.qmd)

[Versioning](versioning.qmd)

[Out of Scope](out-of-scope.qmd)
Loading