-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRAN plans #5
Comments
I agree that getting this on CRAN would be helpful. The only work on tif that I have done since September was to make cleanNLP compliant with the format; I believe @kbenoit has done with same with quanteda as well. In addition to CRAN, how do we feel about getting this to be an official ROpenSci project? In many ways I see that as perhaps even more important in order to get larger adoption. |
Yes, I agree that going through rOpenSci onboarding would be a good step to take before publishing to CRAN. I'm checking this now, but as far as I know all the functions in tokenizers output lists in the format tif expects, and so the package should work with tif. |
@lmullen It looks like tokenizers is good in terms of the output:
However, currently tokenizers does not accept a data frame corpus object for input. At least in the current specification, we had that "packages should accept both and return or coerce to at least one of these". |
@statsmaths Right. What I'm wondering is whether the "accept both" language should be "accept at least one". It seems like with the In other words, For instance, a user might do something like this.
If the consensus is that "accept both" is the right way to go, then I am willing to adopt that in tokenizers. I'd just need |
I fully support the idea of returning to and completing the tif package, along with extensive guidelines for adoption. I would propose that tif have the checking functions, and conversion to and from its own general types, but we issue guidelines for each package to import and export its formats. I’m not sure that these guidelines should dictate nomenclature, since each package has its own preferences for naming things. In quanteda for instance, we have our custom We could however maintain a checklist and table on the GitHub site for tif compliance for each package, with associated function names for I/O. This could be based on a template of tests that each package must pass, after substituting a list of generic function object references with their package’s own functions. If those pass, then a package is fully compliant. I am sprinting through Mar 23 to complete our spring term but could work on this first/second week of April. |
@kbenoit I agree with that description of the basic functionality. That's what the I like the idea of a table of compliance and would be willing to help with that whenever. |
@lmullen Yes, that is exactly what the package does as written. I've changed the column ordering issue with the corpus and tokens object. Can you think of any other outstanding issues that we need to address before publishing? I do think a vignette with a table showing compliant packages would be great to get before pushing to CRAN (and as mentioned above, doing the ROpenSci onboarding would be a good next step after that). |
@statsmaths I've raised a separate question in #8. But if the answer to that is no, then I don't see any reason that this couldn't be published to CRAN. I do think that such a vignette would be great, but it could wait to a point release, since I suspect it will be very time consuming to create. Your call, of course. |
BTW the master branch of tokenizers now meets the requirement to take corpus data frames, which I was waffling on earlier. |
Okay, I agree that we can probably wait on developing a table of compliant packages. I just uploaded some details to follow the ROpenSci onboarding guidelines and the goodpractice checks. Once we resolve issue #8 (I just commented on that issue there), I'm comfortable uploading to CRAN. |
🚀 I'm fine with passing over the suggestion in #8. It would make things needless complicated. |
I think that it would be crucial to include a vignette explaining not only the standard but also how a package can complete the requirements for compliance. I also think we can include some tests where there are globals that can be replaced by each package's functions, and the tests calling the global references to the tested package's functions would run the same test code for compliance. Passing the test = compliance. I think it would be more natural to complete this before publishing on CRAN. I'm happy to work on this from Mar 26. |
A good rule of thumb is that any argument for deliberateness should trump an argument for speed. 😄 I'm willing to pick this back up at the end of March / beginning of April and help in any way with the vignettes and tests. If we are going to wait, I wonder if we should also pursue a more formal means of comment on the formats before release. That could look like sending an e-mail requesting comment to all the people that Ken assembled last year for the meeting, and perhaps another e-mail to all the maintainers with a package on the relevant CRAN task view. We're really talking about a human problem, not a technical problem, and perhaps a widespread review would help get buy in from the community. |
Sounds wise to me. See ropensci/textworkshop18#8 - we thought this could be worth revisiting next month with an aim to closure for an initial release. I'm happy to draft my test suite for certification idea but just cannot squeeze more time out of my next 10 days. |
hello Error in parse(outFile) :
|
It's been a while, but I wanted to inquire whether there are plans to put this tif package on CRAN. I still think it would be a worthwhile addition to the ecosystem. I think the work that @statsmaths did is a good base for a first release. I'm not sure if there are unresolved discussions to have about the interchange formats, and whether we should have that now or at the NYU meeting. But I would find this package very helpful for moving between the various formats required by different text mining packages.
I'm willing to do whatever is necessary if it would be helpful to get this on CRAN.
What are your thoughts, @kbenoit @statsmaths @patperry?
The text was updated successfully, but these errors were encountered: