Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AcTOR query and img function #247

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 165 additions & 0 deletions R/actor.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
#' Retrieve data from AcTOR
#'
#' @description The function allows to search the the
#' Aggregated Computational Toxicology Online Resource (AcTOR), which is the
#' warehouse for different US EPA web applications: \url{https://actor.epa.gov}.
#' Only the top most parameters (e.g. dsstox id, inchi) are queried because
#' more advanced parameters (e.g. Hazard, Acute Toxicity, Use) only represent
#' a loose collection of un-structure data. It is recommended to look these up
#' manually. Entries can only be queried by a valid CASNR.
#'
#' @import httr xml2
#'
#' @param query character; search term.
#' @param from character; type of input. Only "cas".
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CAS are the only parameters to be used in the query AFAIK.

#' @param verbose logical; should a verbose output be printed on the console?
#'
#' @references \url{https://actor.epa.gov}
#' @author Andreas Scharmueller, \email{andschar@@protonmail.com}
#' @return data.frame tibble
#' @export
#' @examples
#' \donttest{
#' # might fail if web site is not available
#' actor("1071-83-6")
#'
#' # multiple inputs
#' comp <- c("1071-83-6", "50-00-0")
#' actor(comp)
#'
#' }
#'
actor <- function(query,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit like a get_*() function, but it retrieves no actual AcTOR-ID (there is none). Hence I sticked to the get_*() style, apart from the function name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this more like a function we would call after get_*()? There are many common endings, e.g. *_query(), *_compinfo(), *_prop(), *_convert(). Apart from molecular weight the rest are IDs, so I assume this function would mostly be used for ID conversions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. It's probably better to name it get_actor(). Ok? I wouldn't name it get_actorid() since there is no AcTOR id.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think get_*() style should be used, because there is no ACToR specific ID as you said. I was thinking more like actor_query(), or actor_compinfo() etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, misunderstood you. I'm fine with actor_query() too.

from = "cas",
verbose = TRUE) {
# checking
from <- match.arg(from)
foo <- function(query,
from,
verbose) {
# url
baseurl <- "https://actor.epa.gov/actor/chemical.xhtml"
prolog <- "?casrn="
qurl <- paste0(baseurl, prolog, query)
# query
if (verbose)
message("Querying: ", query)
site <- try(read_html(qurl))
if (inherits(site, "try-error")) {
message("Error. Returning NA.")
out <- data.frame(query = query,
stringsAsFactors = FALSE)
}
# prepare
chemical_name <- trimws(xml_text(xml_nodes(site, ".chemicalNameFont")))
cas_dsstox <- xml_nodes(site, "#dsstoxSubstanceIdContainerId")
cas <- trimws(xml_text(xml_node(cas_dsstox, "#casrnId")))
dsstox <- trimws(xml_text(xml_child(cas_dsstox[[1]], 3))) # error prone
inchi <- trimws(xml_text(xml_node(site, "#inchiContainerId")))
inchi <- trimws(sub("InChi: InChI=", "", inchi, fixed = TRUE))
inchikey <- trimws(xml_text(xml_node(site, "#inchiKeyContainerId")))
inchikey <- trimws(sub("InChi Key:", "", inchikey))
formula <- trimws(xml_text(xml_node(site, "#molFormulaContainerId")))
formula <- trimws(sub("Molecular Formula:", "", formula))
molecularweight <- trimws(xml_text(xml_node(site, "#molWeightContainerId")))
molecularweight <- trimws(sub("Molecular Weight:", "", molecularweight))
Comment on lines +54 to +65
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function only retrieves basic parameters named here. AcTOR however, contains a lot more data, though in very unstructured formats that are imho not easily parsable (links to documents, various data formats etc.).
Yet, I think it's quite useful to retrieve nice common names and the DSSTOX-ID which is used in the Comptox apllication as an identifier.

# out
if (cas == "") {
# result check
message("No result found. Retuning NA.")
out <- data.frame(query = query,
stringsAsFactors = FALSE)
} else {
out <- data.frame(query = query,
chemical_name = chemical_name,
cas = cas[1],
dsstox = dsstox[1],
inchi = inchi[1],
inchikey = inchikey[1],
formula = formula[1],
molecularweight = molecularweight[1],
stringsAsFactors = FALSE)
}
}
l <- lapply(query, foo, from = from, verbose = verbose)
dplyr::bind_rows(l)
}

#' Retrieve chemical structure images from AcTOR
#'
#' @description The function to retrieve images from the
#' Aggregated Computational Toxicology Online Resource (AcTOR), which is the
#' warehouse for different US EPA web applications: \url{https://actor.epa.gov}.
#'
#' @import curl
#'
#' @param query character; search term.
#' @param from character; type of input. Only "cas".
#' @param dir character; Directory to store the image.
#' @param format character; Image format. Can be on of ("svg", "png", "jpeg").
#' @param width integer; Image width in pixels.
#' @param height integer; Image height in pixels.
#' @param verbose logical; should a verbose output be printed on the console?
#'
#' @references \url{https://actor.epa.gov}
#' @author Andreas Scharmueller, \email{andschar@@protonmail.com}
#' @return disk file
#' @export
#' @examples
#' \donttest{
#' # might fail if web site is not available
#' actor_img("1071-83-6")
#'
#' # multiple inputs
#' comp <- c("1071-83-6", "50-00-0")
#' actor_img(comp)
#' }
#'
actor_img = function(query,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added this functionality because it was possible. So tell me what you think, whether we need such image functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Images have been mentioned recently in Issue #132 and PR #235. I think there is general agreement that images would add a lot to webchem. I will open a separate issue for images so we can discuss the design of these functions so we wouldn't have to discuss consistency later:)

Copy link
Contributor

@stitam stitam May 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have opened Issue #249 for this discussion.

from = "cas",
dir = NULL,
format = c("svg", "png", "jpeg"),
width = 400,
height = 400,
verbose = TRUE) {
# debuging
# query = '1071-83-6'; from = "cas"; format = "svg";
# width = 400; height = 400; dir = '/tmp'; verbose = TRUE
# checking
if (is.null(dir))
stop('Please provide a dir.')
format <- match.arg(format)
foo <- function(query,
from,
dir,
format,
width,
height,
verbose) {
# url
baseurl <- "https://actorws.epa.gov/actorws/chemical/image"
prolog <- "?casrn="
qurl <- paste0(baseurl, prolog, query)
fmt <- paste0("&fmt=", format)
width <- paste0("&width=", width)
height <- paste0("&height=", height)
qurl <- paste0(qurl, width, height, fmt)
# query
if (verbose)
message("Querying: ", query)
fl <- paste0(query, ".", format)
Sys.sleep(rgamma(1, shape = 5, scale = 1/10))
tr <- try(
suppressWarnings(
curl::curl_download(qurl, destfile = file.path(dir, fl), quiet = TRUE)
),
silent = TRUE)
if (inherits(tr, 'try-error'))
message("CAS not found. No file written.")
}
lapply(query, foo,
from = from, dir = dir, format = format,
width = width, height = height,
verbose = verbose)
}

2 changes: 1 addition & 1 deletion R/alanwood.R
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
#' @import xml2
#' @importFrom stats rgamma
#'
#' @param query character; search string
#' @param query character; search string
#' @param type character; type of input ('cas' or 'commonname')
#' @param verbose logical; print message during processing to console?
#' @param force_build logical; force building a new index? See
Expand Down