Big-Life-Lab · yulric · Oct 21, 2024 · Apr 17, 2025
diff --git a/.gitignore b/.gitignore
@@ -7,3 +7,7 @@
 inst/doc
 
 *.code-workspace
+.codegpt
+
+# Python
+.env
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -30,9 +30,11 @@ Suggests:
     kableExtra,
     knitr,
     readr,
+    reticulate,
     rmarkdown,
     survival,
     testthat (>= 3.0.0),
     tibble
 Config/testthat/edition: 3
 VignetteBuilder: knitr
+
diff --git a/scope-docs/.gitignore b/scope-docs/.gitignore
@@ -0,0 +1,2 @@
+/.quarto/
+dist
diff --git a/scope-docs/_quarto.yml b/scope-docs/_quarto.yml
@@ -0,0 +1,23 @@
+project:
+  type: website
+  output-dir: ./dist
+
+website:
+  title: v1.0 Scope 
+  sidebar:
+    style: docked
+    search: true
+    contents:
+      - section: Scope
+        contents:
+          - metadata.qmd
+          - labels.qmd
+          - missing-data.qmd
+          - derived-variables.qmd
+          - logging.qmd
+          - versioning.qmd
+      - out-of-scope.qmd
+format:
+  html:
+    theme: cosmo
+    toc: true
diff --git a/scope-docs/derived-variables.qmd b/scope-docs/derived-variables.qmd
@@ -0,0 +1,244 @@
+---
+title: Derived Variables 
+format:
+  html:
+    embed-resources: true
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(
+  echo = FALSE,
+  warning = TRUE,
+  message = TRUE,
+  error = FALSE
+)
+
+# Load required packages
+library(haven)
+library(reticulate)
+library(knitr)
+library(magrittr)
+```
+
+The variable details sheet within the recodeflow library enables a user to
+encode the rules to create a new variable from one or more starting variables.
+Broadly, any rule that maps the values of one variable into another can be
+encoded in the variable details sheet. Examples include rules that map one or
+more categories of a variable into a new category or a rule that maps an
+interval of a continuous variable into a category. Variables that cannot be
+created with simple mapping rules can be recoded by using a function. Examples
+include variables that require mathematical operations and conditional logic.
+
+Any variable that requires a function for its creation is considered a derived
+variable. A good example of this is BMI which requires mathematical operations
+for its calculation and optionally conditional logic to validate its inputs.
+This document will go over the scope and specifications for such derived
+variables and the functions used to derive them.
+
+## Function Usage
+
+Users should be able to use any derived functions without needing to install 
+recodeflow. The library should not force users to write functions in any way 
+that prevents them from for example being copy-pasted and easily used in an R
+environment where recodeflow is not installed. A number of organizations host 
+personal data and thus impose restrictions on the use of external libraries. 
+Allowing derived functions to be used without recodeflow enables the analysts 
+within these organizations to use derived functions defined in flow universe 
+packages like [cchsflow](https://big-life-lab.github.io/cchsflow/) enabling
+collaboration. 
+
+## Function Language
+
+The library should support only functions written in R. The R ecosystem is
+moving towards allowing its users to execute code from other statistical
+languages. An example is the
+[reticulate](https://rstudio.github.io/reticulate/index.html) package which
+enables the execution of arbitrary Python code within a .R file. Example below,
+
+```{r, python-in-R, echo=TRUE}
+# Python code that creates a function to calculate BMI
+BMI_python_code <- "def BMI(weight, height): 
+  return weight/(height*height)
+" 
+# Run reticulate to bring the Python function over to the R environment enabling
+# us to call it using R code
+reticulate_output <- reticulate::py_run_string(BMI_python_code)
+# Calculate BMI using the created Python function
+BMI_python <- reticulate_output$BMI(25, 5)
+# Expect the calculated BMI to be 1
+stopifnot(BMI_python == 1)
+print(paste("Calculated BMI:", BMI_python))
+```
+
+To reduce the library scope, currently only R functions will be allowed.
+
+## Function Inputs and Outputs
+
+The function inputs and output should be standardized to accept either scalar or
+vector values.
+
+For example, a function that calculates BMI will take as input height and
+weight. The library should make a decision on whether the inputs can be scalar
+values for example `80` and `180` or vector values for example `c(80)` and
+`c(180)`.
+
+This decision will have implications on the performance, user experience, and
+portability of the function as described below.
+
+R being a statistical language is optimized to run on vectors. However, a
+function accepting scalar inputs can be easily vectorized using the
+[Vectorize](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Vectorize.html)
+function available in base R.
+
+The code below displays a table that compares the performance of calculating BMI
+using:
+
+1.  Scalar inputs
+2.  Vector inputs
+3.  A vectorized version of 1 created using the `Vectorize` function
+
+```{r}
+scalar_BMI <- function(weight, height) {
+  bmi <- NA
+  if(is.na(weight) | is.na(height)) {
+    bmi <- NA
+  }
+  bmi <- weight/(height*height)
+}
+
+vector_BMI <- function(weight, height) {
+    bmi <- ifelse(
+      is.na(weight) | is.na(height),
+      NA,
+      weight/(height*height)
+    )
+}
+
+vectorized_BMI <- Vectorize(scalar_BMI)
+
+# Create larger test dataset for better comparison
+test_data <- data.frame(
+  weight = sample(1:100, 1000, replace = TRUE),
+  height = sample(1:100, 1000, replace = TRUE)
+)
+
+# In milliseconds
+get_func_perf <- function(func) {
+  start_time <- Sys.time()
+  func()
+  end_time <- Sys.time()
+  run_time <- end_time - start_time
+  return(as.numeric(run_time)*1000)
+}
+
+# Original methods
+scalar_run_time <- get_func_perf(function() {
+  bmi <- c()
+  for(i in 1:nrow(test_data)) {
+    bmi <- c(bmi, scalar_BMI(test_data[i, "weight"], test_data[i, "height"]))
+  }
+})
+
+vector_run_time <- get_func_perf(function() {
+  vector_BMI(test_data$weight, test_data$height)
+})
+
+vectorized_run_time <- get_func_perf(function() {
+  vectorized_BMI(test_data$weight, test_data$height)
+})
+
+# Dplyr methods - testing only vector_BMI and vectorized_BMI
+dplyr_vector_run_time <- get_func_perf(function() {
+  test_data %>% 
+    dplyr::mutate(bmi = vector_BMI(weight, height))
+})
+
+dplyr_vectorized_run_time <- get_func_perf(function() {
+  test_data %>% 
+    dplyr::mutate(bmi = vectorized_BMI(weight, height))
+})
+
+calculate_percent_diff <- function(num1, num2) {
+  return((abs(num1 - num2)/((num1 + num2)/2))*100)
+}
+
+# Enhanced results table with all valid methods
+run_time_table <- data.frame(
+  "Type" = c("Scalar", 
+             "Vector", 
+             "Vectorized", 
+             "Dplyr (vector_BMI)",
+             "Dplyr (vectorized_BMI)"),
+  "Run Time in ms" = c(scalar_run_time, 
+                      vector_run_time, 
+                      vectorized_run_time, 
+                      dplyr_vector_run_time,
+                      dplyr_vectorized_run_time),
+  "Percent Difference" = c(
+    "Reference", 
+    calculate_percent_diff(scalar_run_time, vector_run_time),
+    calculate_percent_diff(scalar_run_time, vectorized_run_time),
+    calculate_percent_diff(scalar_run_time, dplyr_vector_run_time),
+    calculate_percent_diff(scalar_run_time, dplyr_vectorized_run_time)
+  )
+)
+
+knitr::kable(run_time_table)
+```
+
+In summary, in the order from slowest to fastest we have scalar, vectorized, and
+vector.
+
+Ergonomically, users within R may be more comfortable with vectors since
+datasets are usually in data.frames and individual columns can be plucked out as
+vectors using the \$ operator. Users of non-statistical languages will be more
+comfortable with scalars since they usually deal with scalar values.
+
+Scalar values are more easily portable to non-statistical languages which
+becomes a concern if the recoding rules need to be run for example in a web
+browser where Javascript is the programming language of choice.
+
+## Supported Operations
+
+The library should allow the user to user to use any R code within the function.
+For example, code that uses external libraries, sources other R files etc.
+
+## Tagged NA
+
+The need for tagged NA values has been described elsewhere. An issue with
+derived functions is that users can easily return NA values that are not tagged.
+The library should provide appropriate warnings to users when an NA value is not
+tagged or tagged with an un-recognized value. For example,
+
+```{r}
+BMI_non_tagged_NA <- function(weight, height) {
+  if(is.na(weight) | is.na(height)) {
+    return(NA)
+  }
+  return(weight/height*height)
+}
+
+BMI_tagged_NA <- function(weight, height) {
+  if(is.na(weight) | is.na(height)) {
+    return(haven::tagged_na("a"))
+  }
+  return(weight/height*height)
+}
+```
+
+The BMI_non_tagged_NA function when run by the library should warn the user that
+they have returned an NA value that is not tagged. The BMI_tagged_NA function
+should pass with no issues since the user is returning a tagged NA value.
+
+## Best practices
+
+Keeping in mind that users will be writing derived functions on their own, the
+library should provide guidance on how to write good functions. This can include
+but is not limited to:
+
+-   Properly documenting the function behaviour, inputs, and outputs
+-   Providing examples of running the function using different input types like
+    scalars, vectors, data frames etc.
+-   Providing test cases for the functions
+-   Handling invalid input values and making validations like mins and maxes
+    easily customizable by other users by including them as function parameters
diff --git a/scope-docs/index.qmd b/scope-docs/index.qmd
@@ -0,0 +1,63 @@
+---
+format:
+    html:
+        embed-resources: true
+---
+
+The goal of the recodeflow library is to enable the open, transparent, and
+reproducible creation of a study dataset.
+
+One of the first steps in any quantitative study is the selection and creation
+of variables that will be used to answer the research question. However, this
+process is rarely conducted transparently, resulting in studies that are:
+
+1.  **Inefficient**: Research teams' previous work is seldom transferable to new
+    studies, even if they use the same variables.
+
+2.  **Non-reproducible**: The lack of transparency makes it difficult to
+    reproduce a published study without extensive help from the original
+    authors. If enough time has passed since publication, even the original
+    authors may not remember how the study variables were created.
+
+Increasingly, studies use many variables and complex data transformations,
+making it difficult to reproduce findings and apply methodologies to new
+datasets.
+
+The recodeflow library addresses these issues by:
+
+**1.** Encoding all the recoding rules in a set of CSV sheets that are
+transparent and machine-actionable; and
+
+**2.** Providing software to create a study dataset using the CSV sheets.
+
+While the library meets its core objective, user feedback has identified several
+issues for improvement, including:
+
+- Users not knowing why a function failure has taken place (not enough logging).
+- The documentation not catering to different types of users (not using the
+  divio style of documentation).
+- The function having too many parameters (complex API) etc.
+
+As well, the version clarifies and expands the use of metadata within the data
+analyses workflow, including:
+
+- better support or using labels in tables and graphs;
+- updating the use of 'tagged_NAs' for missing data.
+
+This document contains the scope for the next version of the library that aims
+to address the above and other issues. Explore the scope for different parts of
+the library by clicking on one of links below.
+
+[Metadata](metadata.qmd)
+
+[Labels](labels.qmd)
+
+[Missing data](missing-data.qmd)
+
+[Derived Variables](derived-variables.qmd)
+
+[Logging](logging.qmd)
+
+[Versioning](versioning.qmd)
+
+[Out of Scope](out-of-scope.qmd)