-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data.world #133
Comments
Thanks a lot for your submission @rflprr ! What are the scientific applications of these datasets, do you have any examples to help us assess fit? |
Hi, @maelle. data.world is an open and community-driven platform. We have built a general purpose tool that serves researchers increasingly well (we are still early stage, but quickly adding functionality to aid reproducibility, for example). Searching the platform you can find several users in academia. Not all their datasets are public, so it's hard to give you an idea for the breath of scientific applications. Here is one example of a user who is a PhD student and who have created a few public datasets: https://data.world/jordanageorge?relationship=owner. I hope this helps. |
Thanks @rflprr ! |
Doing my editor checks now, sorry for the delay ... |
HI @rflprr What was the reasoning for splitting out |
Hi, @sckott. Our plan is to provide wrappers for all our API endpoints. That's an ever growing list of low level operations that would pollute the namespace of our main library, which is intended to always expose a very limited set of high-level and easy to use operations. In short, we separated so there is more clear separation of concerns and more cohesive namespaces to better serve what we believe are two distinct groups of users 1) those who need access to low-level APIs and value flexibility, and 2) the vast majority who only care about a few high-level operations and value ease of use. |
@rflprr Okay. Seems the same goal could be done while having the lower level methods in this pkg, but just not exported to the user. - But if there's use cases for a certain kind of user (power user) to use just the low level methods in the other pkg that makes sense. I'd like to have reviewers at least take a brief look at |
@sckott, reviewing |
Editor checks:
Editor commentsThanks for your submission @rflprr ! Currently seeking reviewers. It's a good fit and not overlapping.
It is good practice to
✖ write unit tests for all functions, and all package code
in general. 60% of code lines are covered by test cases.
R/data.world-defunct.R:65:NA
R/data.world-defunct.R:67:NA
R/data.world-defunct.R:68:NA
R/data.world-defunct.R:70:NA
R/data.world-defunct.R:71:NA
... and 55 more lines For `` It is good practice to
✖ write unit tests for all functions, and all package code
in general. 79% of code lines are covered by test cases.
R/add_file.R:65:NA
R/add_file.R:95:NA
R/add_file.R:97:NA
R/add_file.R:98:NA
R/add_file.R:99:NA
... and 162 more lines
✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...),
1:NROW(...) and 1:NCOL(...) expressions. They are error prone and
result 1:0 if the expression on the right hand side is zero. Use
seq_len() or seq_along() instead.
R/dataset_summary_response.R:41:15
R/table_schema_response.R:28:15
✖ fix this R CMD check NOTE: Namespace in Imports field not
imported from: ‘xml2’ All declared Imports should be used.
✖ fix this R CMD check NOTE: Found the following hidden
files and directories: .lintr These were most likely included in
error. See section ‘Package structure’ in the ‘Writing R
Extensions’ manual. Reviewers: @MarkEdmondson1234 @nistara |
Thanks for agreeing to review @MarkEdmondson1234 & @nistara !!! Due date is 2017-08-23 let me know if you have any questions about any part of the process |
Package Review - data.world-r (dwapi-r)SummmaryOverall the package is a nice addition, allow access to a crowd-sourced variety of data that would be useful in many projects. The richness of the datasets will depend largely on how popular the website The line between Review template
DocumentationThe package includes all the following forms of documentation:
I could install with no problems following the README instructions. The "Hello World" example worked quickly and well.
The vignettes were the quickstartI followed the first vignette as suggested via The installation instructions offer three options (runtime, environment variables, runtime), but only include an example for one (via config file). The help files in The start guide states to not keep your API token in code, but then only gives examples where it is in code. I manually setup a environment argument in my I did it like this: saved_cfg <- data.world::save_config(Sys.getenv("DATADOTWORLD"))
data.world::set_config(saved_cfg) ...but its not clear if this is a good way to handle it or how Now I inspect the code I see it expects The API token is also very long, is this necessary? Compare the GitHub personal access token and this packages: nchar(Sys.getenv("DATADOTWORLD"))
# [1] 340
nchar(Sys.getenv("GITHUB_PAT"))
# [1] 40 queryThis vignette went into a lot more detail on how to query data, with some good links to further documentation. All the SQL examples worked, but the assists_vs_height <- data.world::qry_sparql(paste(
"BASE <http://data.world/jonloyens/an-intro-to-dataworld-dataset/> ",
"PREFIX t: <DataDotWorldBBallTeam.csv/DataDotWorldBBallTeam#> ",
"PREFIX s: <DataDotWorldBBallStats.csv/DataDotWorldBBallStats#> ",
"SELECT ?name ?height ?assists WHERE { ",
" ?pt t:Name ?name . ",
" ?ps s:Name ?name . ", # the join column
" ?pt t:Height ?height . ",
" ?ps s:AssistsPerGame ?assists . ",
"} ",
"ORDER BY DESC (?assists)", sep = "\n"
))
data.world::query(assists_vs_height, intro_ds)
# A tibble: 0 x 3
# ... with 3 variables: name <chr>, height <chr>, assists <chr> And the parameterized example errored: assists_greater_than <- data.world::qry_sparql(paste(
"BASE <http://data.world/jonloyens/an-intro-to-dataworld-dataset/> ",
"PREFIX t: <DataDotWorldBBallTeam.csv/DataDotWorldBBallTeam#> ",
"PREFIX s: <DataDotWorldBBallStats.csv/DataDotWorldBBallStats#> ",
"SELECT ?name ?height ?assists WHERE { ",
" ?pt t:Name ?name . ",
" ?ps s:Name ?name . ", # the join column
" ?pt t:Height ?height . ",
" ?ps s:AssistsPerGame ?assists . ",
" FILTER(?assists > $v1) ",
"} ",
"ORDER BY DESC (?assists)", sep = "\n"
))
assists_greater_than$params <- c("$v1" = 10)
data.world::query(assists_greater_than, intro_ds)
# No encoding supplied: defaulting to UTF-8.
# Error in dwapi::sparql(dataset = dataset, query = qry$query_string, query_params = qry$params) :
# Error: Unbound variable used in filter: $v1 (line 9, column 21)
There was a helpful list of defunct functions, I should do that for my packages too :) There was also a helpful intro help file under The help files for The documentation for the main useful functions such as
The most important functions in the package (as judged by the vignette metions) do not have examples in their help files yet - namely
There are no contributing guidelines yet. The DESCRIPTION has a URL and BugReports to GitHub, as well as the authors/maintainers via The Description looks like it needs to be line-breaked at 80 characters more, as it did not display correctly compared with other packages in the Help pane. Functionality
The tests covered a good range and used mock testing to quickly pass on the local machine.
Code review - data.world
structure()The below code could use the existing e.g. ## present function
qry_sql <- function(query_string, params = NULL) {
me <- list(query_string = query_string, params = params)
class(me) <- "qry_sql"
return(me)
}
## suggested
qry_sql <- function(query_string, params = NULL) {
structure(
list(query_string = query_string, params = params),
class = "qry_sql"
)
} Repeat for config ini filesOnly after examination of the code did I find the package is writing discovery queriesIts a little confusing what should Code review - dwapilinter issue?The formatting for function arguments looked like incorrect indentation to me, and RStudio agreed when I did # before
file_create_request <- function(file_name,
url,
description = NULL,
labels = NULL) {
if (file_name == "" | url == "") {
stop("name and url required")
}
me <-
list(name = file_name, source = file_source_create_request(url))
if (!is.null(description)) {
me$description <- description
}
if (!is.null(labels)) {
me$labels <- labels
}
class(me) <- "file_create_request"
return(me)
}
# after CMD+I
file_create_request <- function(file_name,
url,
description = NULL,
labels = NULL) {
if (file_name == "" | url == "") {
stop("name and url required")
}
me <-
list(name = file_name, source = file_source_create_request(url))
if (!is.null(description)) {
me$description <- description
}
if (!is.null(labels)) {
me$labels <- labels
}
class(me) <- "file_create_request"
return(me)
} using structure()Again # before
file_create_or_update_request <- function(file_name,
url = NULL,
description = NULL,
labels = NULL) {
if (file_name == "") {
stop("name is required")
}
me <- list(name = file_name)
if (!is.null(url)) {
me$source <- file_source_create_or_update_request(url)
}
if (!is.null(description)) {
me$description <- description
}
if (!is.null(labels)) {
me$labels <- labels
}
class(me) <- "file_create_or_update_request"
return(me)
}
# after
# function to remove NULLs from lists.
compact <- function(x) {
null <- vapply(x, is.null, logical(1))
x[!null]
}
file_create_or_update_request <- function(file_name,
url = NULL,
description = NULL,
labels = NULL) {
if (file_name == "") {
stop("name is required")
}
me <- structure(
list(
name = file_name,
# may need to alter to return NULL if passed NULL
source = file_source_create_or_update_request(url),
description = description,
labels = labels
),
class = "file_create_or_update_request"
)
compact(me)
} growing vectorsThe method of adding to requests doesn't look like R standards since its using Also why is # before
add_file.dataset_create_request <-
function(request,
name,
url,
description = NULL,
labels = NULL) {
existing_files <- request$files
# O(N) ?
existing_files[[length(existing_files) + 1]] <-
dwapi::file_create_request(
name, url, description = description, labels = labels)
request$files <- existing_files
request
}
# after
add_file.dataset_create_request <-
function(request,
name,
url,
description = NULL,
labels = NULL) {
request$files <- c(request$files,
file_create_request(name,
url,
description = description,
labels = labels))
request
} Using httr::RETRYYou may want to consider using Creating a generic API fetch functionA lot of the API requests are constructing their own e.g something like:
Then API functions can use it like: # before
delete_files <-
function(dataset, file_names) {
query_param <-
lapply(file_names, function(name)
sprintf("name=%s", name))
if (length(query_param) == 0) {
print("empty names input = no-op")
} else {
api_url <-
paste(
sprintf(
"%s/datasets/%s/files?", getOption("dwapi.api_url"),
extract_dataset_key(dataset)),
paste0(query_param, collapse = "&"),
sep = ""
)
auth <- sprintf("Bearer %s", auth_token())
response <-
httr::DELETE(
api_url,
httr::add_headers(Authorization = auth),
httr::user_agent(user_agent())
)
ret <- parse_success_or_error(response)
ret
}
}
# after
delete_files <-
function(dataset, file_names) {
query_param <-
lapply(file_names, function(name)
sprintf("name=%s", name))
if (length(query_param) == 0) {
print("empty names input = no-op")
} else {
dw_do_api("DELETE",
path_url = sprintf("/datasets/%s/files", extract_dataset_key(dataset)),
query_param = query_param)
}
} Repeat as necessary for all the other API requests. You could also add the response status code test for Final approval (post-review)
Estimated hours spent reviewing: 5 |
thanks @MarkEdmondson1234 very much for your review! |
@sckott and @MarkEdmondson1234, first, thank you. Second, I'm unsure about the process at this point. Lot's of good feedback here, but no sense of severity. With limited resources, I'd like to limit any changes to a minimum and continue to implement improvements over time. What is your expectation at this point? The main point I would like to address is why data.world and dwapi. We are serving two audiences with those packages, those looking for high-level and low entry barrier features and those looking for low-level and maximum flexibility, respectively. I want to make them as easy to learn and their features easy to discover for the right users and that's why I opted for two packages (two namespaces) and to export all relevant functions from each. data.world is sparse right now because it's still in its infancy we want to be judicious about what features we add to it. Our intent is for it to contain functionality that is most useful in the context of interactive programming (RStudio, R Notebooks, etc). If you object to this approach or see any severe implementation/conformance flaws, let me know. Otherwise, I'll create issues on our repo for everything you reported and would like to address them over time. Thanks again! |
@rflprr at this point I think it's best to wait for the 2nd review to come in to have all the reviewer comments. Once the 2nd review is in, I can help think about which changes may be most important. |
Hello, @sckott. Thanks for your response. Any idea of when I should expect the 2nd review to be completed? |
SummaryThe Overall I think these are very useful packages which will encourage the use and analyses of data from data.world! Package ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
DocumentationThe package includes all the following forms of documentation:
I suggest giving a little more information about the
I could install the development version of the package without any problem.
The vignettes were helpful in understanding the core functionalities of the package. quickstart Does not mention how configuration can be provided via environment variables and at runtime. Tangentially related, when I signed up with data.world, there was a typo for "collaborators" in the getting started screen. query The SELECT queries
Parameterized SPARQL queries
Functionality
Final approval (post-review)
Estimated hours spent reviewing: 7 |
@rflprr let me or the reviewers know if you have any questions as you go through reviews |
@sckott While constructive and certainly helpful, I'm not seeing anything in the feedback that appears to be severe/urgent. If you do, calling those things out would be the most helpful thing at this point. Thank you very much, @MarkEdmondson1234 and @nistara. Off-topic: If you know good R engineers who we could hire on a freelance basis to help us improve and continue to develop these libraries, I would appreciate introductions. |
@rflprr can you clarify what you mean by "I'm not seeing anything in the feedback that appears to be severe/urgent"? the way our process works is we want submitters to respond to reviewers, making changes or if they don't think they should change something then give a reason why not. |
to clarify: The purpose of this review is to improve the package along many lines (usability, performance, documentation, maintainability). The first people you need to convince of what is or is not necessary are the reviewers. Here's a good example of responding to reviewers (though it doesn't all have to be in one comment): #93 (comment) If you and a reviewer disagree about the necessity of something, then, an editor can weigh in (e.g., #127 (comment)) While we understand that there is considerable effort in updating a package post-review, we think it is worth it. |
@rflprr Any thoughts on my comments above? |
@rflprr we do expect responses to reviewers, you don't have to respond to every single item they raised, but no response is not allowed. Please get your responses to reviewers in within 3 weeks. if not, we'll close the issue, and you're welcome to resubmit at a later point if you like |
@sckott sorry for the communication hiatus (it's been busy period and we are resource constrained). The reviewers did a very thorough job. Reading though both reviews I see many excellent suggestions, but none that would significantly improve (or de-risk) the user experience in the immediate term (in other words, none that appear to be critical). Most importantly, all suggestions can be implemented in a backwards compatible way over time. I cannot commit to making these changes at the moment, so we can proceed with the packages in their current stage or pause/abort the onboarding process, if you think that is the right thing to do. Additionally, we intend to continue to develop these packages if I can find help. I was serious when I asked for references for free-lance developers. If you know good ones and could get me in touch with someone, I would certainly appreciate any intros. |
Okay, we'll put it on hold for now. That means if you or someone else has time later to make changes/respond to reviewers, we can remove the holding tag and proceed. |
According to our policies https://ropensci.github.io/dev_guide/policies.html#review-process it's been more than 1 year since this has been on hold (holding label) and will now be closed. If you wish to re-submit, you'll need to start a new submission |
Summary
What does this package do? (explain in 50 words or less):
This package allows users to consume data hosted on data.world. By way of its sibling package (dwapi) it also allows users to invoke data.world API endpoints.
Paste the full DESCRIPTION file inside a code block below:
URL for the package (the development repository, not a stylized html page):
https://github.com/datadotworld/data.world-r
Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):
[e.g., "data extraction, because the package parses a scientific data file format"]
This package fits under Data Retrieval (and/or, possibly, Data Publication), because it allows access to data on data.world and, also, creation of datasets via API wrappers in the dwapi package.
Data scientists, data analysts, researchers, etc. — any professional or enthusiast working on data projects.
yours differ or meet our criteria for best-in-category?
I am not aware of any other R packages designed to work with data.world.
Requirements
Confirm each of the following by checking the box. This package:
Publication options
paper.md
with a high-level description in the package root or ininst/
.Detail
Does
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings:Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:
The text was updated successfully, but these errors were encountered: