Benjamin's Initial Evaluation Brain Dump #133

nutterb · 2016-09-18T03:34:35Z

I apologize for the length but I wanted to make this available for review and discussion. I thought it would be good to get some feedback on ideas before I do any code work. i can break these up into separate issues when there's a specific aspect needing focus.

Also, be honest and direct in your responses. Part of this process will be transitioning me away from my redcapAPI mindset to the REDCapR mindset. I promise not to have hurt feelings if you don't want to incorporate anything in this list.

Observed Structure

Stand-alone functions

checkbox_choices
regex_named_captures
redcap_column_sanitize
replace_nas_with_explicit

The API Calls

This is the code I used to plot out how the existing API calls interact with each other. The blue boxes are my proposals for new functions. A child node indicates a function that is called within its parent.

    library(HydeNet)

    redcap <- 
      HydeNetwork(
        ~ redcap_metadata_read  + 
          redcap_read + 
          redcap_write + 
          redcap_download_file_oneshot + 
          redcap_upload_file_oneshot + 
          sanitize_token | redcap_metadata_read * 
                           redcap_read * 
                           redcap_write * 
                           redcap_download_file_oneshot * 
                           redcap_upload_file_oneshot +
          api_call | redcap_metadata_read * 
                           redcap_read * 
                           redcap_write * 
                           redcap_download_file_oneshot * 
                           redcap_upload_file_oneshot + 
          redcap_read_oneshot | redcap_read + 
          redcap_read_batch | redcap_read + 
          create_batch_glossary | redcap_read_batch *
                                  redcap_write_batch + 
          format_redcap_data | redcap_read + 
          redcap_write_oneshot | redcap_write +
          redcap_write_batch | redcap_write
      ) %>%
      setDecisionNodes("sanitize_token",
                       "format_redcap_data",
                       "api_call",
                       "redcap_read_batch",
                       "redcap_write_batch")

    plot(redcap)

*Less complex Systems *

populate_project_simple calls redcap_project
retrieve_token_mssql calls retrieve_credential_mssql
retrieve_credential_local
validate_no_logical and validate_no_uppercase call
validate_for_write

New functions

`sanitize_token`

As of 6.9.5,

The API is now more strict with regard to the validation of API tokens
sent in API requests. In previous versions, if the token was longer
than 32 characters, it would truncate the token to 32 characters
(which is the expected length). It no longer truncates the token if
longer than expected but merely returns an error message."

It wouldn't be hard to write a function to sanitize the token. It would
look something like:

sanitize_token <- function(token){
  token <- trimws(token)
  token <- substr(token, 1, 32)
  token
}

This has the same effect as the code introduce through Issue #103
(#103), and has the added
benefit of providing consistency of tokens into earlier version of
REDCap.

`api_call`

It might seem silly, but if the httr interface ever undergoes a
change, or if there's ever a decision to change the package making the
web call, putting all of the POST calls into one function means only
have to change that one function, instead of finding all of the calls in
the package.

`redcap_read_batch`

Would it simplify the interface for the user to have redcap_read be
the driver for exporting data, and to make it a wrapper for
redcap_read_oneshot and a new function redcap_read_batch?
redcapAPI::exportRecords uses batch.size = -1 to indicate reading in
one shot. Something similar could be done to direct traffic between the
subfunctions without affecting backward compatibility (though
redcap_read_oneshot would need to continue to be exported.)

`redcap_write_batch`

Similar concept to redcap_read_batch. Just make it one function to call for the user.

New Features

Data formatting: The meta data provide enough information to convert
fields to factors, Date, or POSIXct objects, as appropriate. I
propose a logical format_data argument to redcap_read that
formats the data (similar to how redcapAPI formats) when TRUE. A
new function format_data would be the engine to do this work.
Allow the meta data to be passed as an object to redcap_read and
redcap_write. redcapAPI has a meta_data = NULL argument that
allows the user to pass in the meta data data.frame if it exists. If
given, there is no need to download the meta data again. If NULL,
the meta data is downloaded. Not strictly necessary, but can save a
bit of time. I'd consider this pretty low priority, because it is a
feature that can be added without affect back compatibility.
New methods for exporting other data objects, such as the events
table, the arms table, the mappings table. I don't know that these
get used much, but they can be handy for formatting event names
and such.
Provide internal validations prior to writing data. Honestly, I have
mixed feelings about this. I put a lot of work into doing this in
redcapAPI, but the REDCap validations return some pretty good
messages as well.

Code Improvements

Sections of code like

events_collapsed <- ifelse(is.null(events), "", paste0(events, collapse=","))

can be optimized with

events_collapsed <- if (is.null(events)) "" else paste0(events, collapse = ",")

It isn't a big deal, (see benchmarking below, but as a package matures,
I think it's a good thing to incorporate some efficiency gains)

library(microbenchmark)
library(magrittr)
records <- NULL
microbenchmark(
  ifelse = ifelse(is.null(records), "", paste0(records, collapse=",")),
  strict_if = if (is.null(records)) "" else paste0(records, collapse = "")
)

## Warning in microbenchmark(ifelse = ifelse(is.null(records), "",
## paste0(records, : Could not measure a positive execution time for 6
## evaluations.

## Unit: nanoseconds
##       expr  min   lq    mean median   uq   max neval cld
##     ifelse 1155 1540 1991.12   1541 1926 11934   100   b
##  strict_if    0    1  397.44      1    1 26948   100  a

Other Notes

Since we're using RODBCext to manage credentials, it wouldn't be
hard to write up instructions for acessing credentials from any
SQL database. Users would just need to know their driver. We could
probably make an adaptation that pulls from a CSV using the
sqldf package. It would make a consistent style of management that
would encourage best practices, even for users who don't have the
benefit of SQL databases.
We should do some benchmarking to see if an non-sparse eav of a
large data file is faster than the csv export. How large does a
database have to be with the eav format before the server times
out? If we find the eav to be faster, perhaps we could up the
batch number from 100 to something larger (fewer calls to the API
would mean less wait time for the user, but I suspect you would want
to continue batching to make the server available to other users.
Upping the default batch number seems like a good compromise.)

I should also add that at one point I had compiled a list of the API changes announced on the group. I was making notes on what changes I might incorporate into redcapAPI and how those changes could affect the R package. The link to the spreadsheet is
https://docs.google.com/spreadsheets/d/1NMdpb-k5nvrVxF0gvnpIfQyP4vZIpQgUxoLUKJuv2L0/edit?usp=sharing

The text was updated successfully, but these errors were encountered:

wibeasley · 2017-03-21T20:51:23Z

@nutterb, sorry for not responding to your thoughtful post until now. I've read it several times since September. In short, I like your ideas and think they're all worth discussing. Some of these are large/mature/encapsulated enough they're spun off into entire issues.

Some specific reactions.

New Functions

redcap_read_batch() & redcap_write_batch(). See Read EAV - w/ batching #145. If I use your approach of batching within the function, your advice makes even more sense.
api_call my thought is that would create another layer of indirection that I'm more likely to mess up in the future. Also, I'm guessing it would be easier to return better error messages if something like redcap_read() encounters the error (than if api_call()).

But I certainly agree that there's a lot of duplication across functions like redcap-metadata-read.R() and redcap-read.R(). I'm mildly convinced that I'm on the right side of the tradeoffs, but certainly could be persuaded otherwise.
I'm phasing out retrieve_token_zzz() functions with retrieve_credential_zzz calls, so that removes a little complexity, and I think becomes aligned with your nested advice. The credential (as opposed to token) retrieval is more general and should work in a wider variety of scenarios.
I like your sanitize_token(). See new issue sanitize_token() #144.

New Features

Data formatting & additional metadata: Where would this metadata be stored?
- In REDCap annotation field (ie, the big dumb box)?
- Passed by the programmer, similar to readr's col()'
- Passed by the programmer, reusing readr's col()' (and then using readr's text-to-tbl functionality?)
Spin this off into a new issue if you'd like.
Other API calls, like events. Sounds good.
internal validations before writing from the client to REDCap? Can you point to examples?

Code Improvements

As long as the test suite has appropriate coverage of these areas, and doesn't break, I'm all for it. I had no idea ifelse() was so inferior to the approach you benchmarked. Any idea if dplyr::if_else() is as slow? I'll check it out if you haven't.

I think the lowest hanging fruit could still be associated with structuring network calls so they're less chatty & wasteful. But I don't have any ideas beyond EAV; and even that hits a limit and needs batching.

Other Notes

I agree with your two points. So far the EAV playground has been beating the flat export almost every time, even with non-sparse data. But I'd love to know more and describe the performance envelope to users.

Thanks again for all the thought you put into this, @nutterb.

nutterb · 2017-04-01T02:29:49Z

api_call my thought is that would create another layer of indirection that I'm more likely to mess up in the future. Also, I'm guessing it would be easier to return better error messages if something like redcap_read() encounters the error (than if api_call()).

I think this is something that the checkmate package handles well. I've been using it extensively in another package. It uses an assertCollection object that can be passed between functions. The assertions can be reported from the main function and it will appear to the user that all of the errors are coming from the function they executed.

A crude, silly example is below. A less trivial example is the interaction between the sprinkle_bg function and the index_to_sprinkle

had no idea ifelse() was so inferior to the approach you benchmarked. Any idea if dplyr::if_else() is as slow? I'll check it out if you haven't.

The if and else approach is faster, but usually only advantageous when dealing with a single logical value. ifelse keeps copies of both the yes and no elements, does a bunch of error checking, and a number of other things that make it very useful for vectorized conditionals, but quite a bit slower. dplyr::if_else has the same limitations. Compare the results of the second code block at the bottom.

But as you mentioned, these kinds of optimizations are not going to make a big difference to the user. optimizing the calls to the API will help a lot, as will optimizing the construction of data sets from batches. This is just one of those things that I change when I see them.

passing error messages with `checkmate`

main_function <- function(x, y, z)
{
  coll <- checkmate::makeAssertCollection()
  
  checkmate::assert_numeric(x = x,
                            len = 1,
                            add = coll)
  
  x_add <- sub_function1(x, y, coll)
  
  res <- sub_function2(x_add, z, coll)
  
  checkmate::reportAssertions(coll)
  
  res
}


sub_function1 <- function(x, y, coll)
{
  
  checkmate::assert_numeric(x = y,
                            len = 1,
                            add = coll,
                            .var.name = "y")
  
  x + y
  
}


sub_function2 <- function(x, z, coll)
{
  checkmate::assert_character(x = z,
                              len = 1,
                              add = coll,
                              .var.name = "z")
  
  paste0(round(x, 2), " ", z)
}

main_function(1, 2, "units")

main_function(1, 2, c("km", "m"))

If else benchmarking

 records <- NULL
microbenchmark(
  ifelse = ifelse(is.null(records), "", paste0(records, collapse=",")),
  if_else = if_else(is.null(records), "", paste0(records, collapse=",")),
  strict_if = if (is.null(records)) "" else paste0(records, collapse = ",")
)

x <- factor(sample(letters[1:5], 10, replace = TRUE))
microbenchmark(
  base = ifelse(x %in% c("a", "b", "c"), x, factor(NA)),
  dplyr = if_else(x %in% c("a", "b", "c"), x, factor(NA))
)

This was referenced Mar 21, 2017

sanitize_token() #144

Closed

Read EAV - w/ batching #145

Open

wibeasley mentioned this issue Jun 30, 2017

include new assert functions OuhscBbmc/OuhscMunge#21

Closed

7 tasks

wibeasley mentioned this issue Jan 7, 2018

read reports #182

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benjamin's Initial Evaluation Brain Dump #133

Benjamin's Initial Evaluation Brain Dump #133

nutterb commented Sep 18, 2016 •

edited

Loading

wibeasley commented Mar 21, 2017

nutterb commented Apr 1, 2017

Benjamin's Initial Evaluation Brain Dump #133

Benjamin's Initial Evaluation Brain Dump #133

Comments

nutterb commented Sep 18, 2016 • edited Loading

Observed Structure

Stand-alone functions

The API Calls

New functions

sanitize_token

api_call

redcap_read_batch

redcap_write_batch

New Features

Code Improvements

Other Notes

wibeasley commented Mar 21, 2017

New Functions

New Features

Code Improvements

Other Notes

nutterb commented Apr 1, 2017

passing error messages with checkmate

If else benchmarking

nutterb commented Sep 18, 2016 •

edited

Loading

`sanitize_token`

`api_call`

`redcap_read_batch`

`redcap_write_batch`

passing error messages with `checkmate`