Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PRE REVIEW]: Synthia: multi-dimensional synthetic data generation in Python #2779

Closed
whedon opened this issue Oct 24, 2020 · 46 comments
Closed

Comments

@whedon
Copy link

whedon commented Oct 24, 2020

Submitting author: @dmey (D. Meyer)
Repository: https://github.com/dmey/synthia
Version: 1.0.0
Editor: @oliviaguest
Reviewers: @khinsen, @mnarayan
Managing EiC: Kyle Niemeyer

⚠️ JOSS reduced service mode ⚠️

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Author instructions

Thanks for submitting your paper to JOSS @dmey. Currently, there isn't an JOSS editor assigned to your paper.

The author's suggestion for the handling editor is @arfon.

@dmey if you have any suggestions for potential reviewers then please mention them here in this thread (without tagging them with an @). In addition, this list of people have already agreed to review for JOSS and may be suitable for this submission (please start at the bottom of the list).

Editor instructions

The JOSS submission bot @whedon is here to help you find and assign reviewers and start the main review. To find out what @whedon can do for you type:

@whedon commands
@whedon
Copy link
Author

whedon commented Oct 24, 2020

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks.

⚠️ JOSS reduced service mode ⚠️

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

@whedon
Copy link
Author

whedon commented Oct 24, 2020

Software report (experimental):

github.com/AlDanial/cloc v 1.84  T=0.12 s (354.6 files/s, 60453.2 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
SVG                              1              0              0           4607
Python                          20            316            361            839
Markdown                         7             89              0            115
Jupyter Notebook                 4              0            390             81
YAML                             3             11              5             71
CSS                              1              7              7             61
TeX                              1              5              0             47
reStructuredText                 3             38             66             41
INI                              1              0              0              2
HTML                             1              0              0              2
-------------------------------------------------------------------------------
SUM:                            42            466            829           5866
-------------------------------------------------------------------------------


Statistical information for the repository '40a53db89b90e75a2c9bfb3d' was
gathered on 2020/10/24.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Thomas Nagler                    2           137             34            7.47
dmey                             9          1766            353           92.53

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
Thomas Nagler                95           69.3          0.0               28.42
dmey                       1421           80.5          0.0                9.85

@whedon whedon added the Python label Oct 24, 2020
@whedon
Copy link
Author

whedon commented Oct 24, 2020

PDF failed to compile for issue #2779 with the following error:

Can't find any papers to compile :-(

@kyleniemeyer
Copy link

@whedon generate pdf from branch joss-paper

@whedon
Copy link
Author

whedon commented Oct 24, 2020

Attempting PDF compilation from custom branch joss-paper. Reticulating splines etc...

@whedon
Copy link
Author

whedon commented Oct 24, 2020

👉📄 Download article proof 📄 View article proof on GitHub 📄 👈

@kyleniemeyer
Copy link

@whedon query scope

@whedon whedon added the query-scope Submissions of uncertain scope for JOSS label Oct 24, 2020
@whedon
Copy link
Author

whedon commented Oct 24, 2020

Submission flagged for editorial review.

@kyleniemeyer
Copy link

Hi @dmey, thanks for your submission to JOSS. Due to the relatively small size of your software package, the editorial board is going to take a closer look at whether it falls within our scope.

@dmey
Copy link

dmey commented Oct 24, 2020

Hi @kyleniemeyer, many thanks for letting me know. In case this may be of relevance to the board, this package have been used in two papers (currently in preparation) which I am planning to submit in the next few weeks. Furthermore, the tool is novel in its approach, well written and likely to be cited by future machine learning (ML) groups.

@dmey
Copy link

dmey commented Oct 24, 2020

@kyleniemeyer just as clarification to my previous message -- as I am going to upload the scientific papers that make use of/cite Synthia on arXiv in a couple of weeks while their peer-review takes place, I can update this thread with links to those respective papers. Originally, I thought that this was going to be discussed during review but I am more than happy to wait here if that will make it easier to show the novelty and contribution of this tool to the community.

@VivianePons
Copy link

I'm having a look at the paper regarding the scope query requested by @kyleniemeyer . Is it normal that the paper is extremely short? The Github pdf only contains a "Summary" and "Acknowledgments" sections, it seems rather incomplete and I wonder if this is an involuntary mistake

@dmey
Copy link

dmey commented Oct 26, 2020

@VivianePons thanks for looking into this. My understanding is that the summary paper needs to be very short -- abstract like -- as it is meant only as summary of the motivation and purpose of the tool and because the purpose of the review is to review the software rather than paper as done in more traditional journals. I have checked again at https://joss.readthedocs.io/en/latest/submitting.html and it says that the summary paper should be between 250-1000 words but I am more than happy to extend this, especially given that my first draft was much much longer and cut it down considerably at submission to make it more to the point.

@VivianePons
Copy link

Indeed, papers are rather short but they are still a bit more furnished. Look at our example paper: https://joss.readthedocs.io/en/latest/submitting.html#example-paper-and-bibliography

In particular, papers should contain a "Statement of need" which is missing in your case. You can also have some other sections such as "Features", "Examples"

You can browse through our recent publications to give you an idea.

@dmey
Copy link

dmey commented Oct 26, 2020

@VivianePons many thanks for clarifying this, please allow me to make the necessary changes as advised.

@danielskatz
Copy link

@whedon check references from branch joss-paper

@whedon
Copy link
Author

whedon commented Oct 26, 2020

Attempting to check references... from custom branch joss-paper

@whedon
Copy link
Author

whedon commented Oct 26, 2020

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1201/b17116 is OK
- 10.1109/DSAA.2016.49 is OK

MISSING DOIs

- None

INVALID DOIs

- None

@VivianePons
Copy link

In particular, I would like to understand what your software adds specifically in terms of implementation. Considering the small amount of code, we might fear that it is mainly a python wrapper to some other tools like vinecopulib. Could you give us some information regarding this aspect?

@arfon
Copy link
Member

arfon commented Oct 29, 2020

Could you give us some information regarding this aspect?

@dmey - could you elaborate? This will help us making our editorial scope decision.

@dmey
Copy link

dmey commented Oct 29, 2020

@arfon -- may I give you my response by early next week?

@VivianePons
Copy link

No problem!

@dmey
Copy link

dmey commented Nov 4, 2020

@VivianePons apologies for the delay but I have had no time to look at this yet -- could I get back sometime in the next week? Thanks.

@dmey
Copy link

dmey commented Nov 13, 2020

@whedon generate pdf from branch joss-paper

@whedon
Copy link
Author

whedon commented Nov 13, 2020

Attempting PDF compilation from custom branch joss-paper. Reticulating splines etc...

@whedon
Copy link
Author

whedon commented Nov 13, 2020

👉📄 Download article proof 📄 View article proof on GitHub 📄 👈

@dmey
Copy link

dmey commented Nov 13, 2020

@arfon @VivianePons @kyleniemeyer and @danielskatz many thanks for allowing me to get back to you this week. We have recently extended the documentation, added more examples, and reworded the paper to address what I think were your main concerns. We have also added a couple of new features, that is, the handling of discrete and categorical data in the last two releases which brings the number of lines of pure Python code to 1097 (please see cloc output below).

With regards to your individual questions --

In particular, I would like to understand what your software adds specifically in terms of implementation.

@VivianePons thanks for raising this -- looking at the paper and repository with fresh eyes, I can see how this was unclear. I have now made changes to the repository, paper and website and hope that the changes make the purpose clearer. With regards to your specific question, Synthia can currently be used to model univariate and multivariate data, parameterize marginals with empirical and parametric methods and apply manipulations such as stretching and uniformization (I have added a summary at https://dmey.github.io/synthia/features.html). For multivariate data we support three different types of methods: fPCA, parametric (Gaussian) copula, and vine copula models and provide a pure Python implementation for the former two and rely on vinecopulib for the latter. Recently we have also added the capability to handle discrete and categorical data when using vine copulas.

Considering the small amount of code, we might fear that it is mainly a python wrapper to some other tools like vinecopulib. Could you give us some information regarding this aspect?

We have tried to write Synthia succinctly and the current lines of pure Python code according to the cloc tool is 1097 (see below). The use of vinecopulib is important but it is not a required dependency. In our installation vinecopulib is also marked as an optional dependency (see https://dmey.github.io/synthia/installation.html). The amount of code that corresponds to the integration with vinecopulib is very very small, about 20-30 lines of code. Furthermore, although vinecopulib does play an important role in Synthia, its purpose is limited to the generation of vines not that of data generation in general.
As Synthia presents a new method for generation using multidimensional data in Python using fPCA, together with gaussian and vine copulas models, natively handle multidimensional arrays and datasets (essential in componential sciences), and the parametrizations and manipulation of univariate distribution in a single tool, I believe the paper is within scope.

The scope of the journal (https://joss.readthedocs.io/en/latest/submitting.html) indicates that [our bold]:

JOSS publishes articles about research software. This definition includes software that: solves complex modeling problems in a scientific context (physics, mathematics, biology, medicine, social science, neuroscience, engineering); supports the functioning of research instruments or the execution of research experiments; extracts knowledge from large data sets; offers a mathematical library, or similar.

JOSS publishes articles about software that represent substantial scholarly effort on the part of the authors. Your software should be a significant contribution to the available open source software that either enables some new research challenges to be addressed or makes addressing research challenges significantly better (e.g., faster, easier, simpler)

I cited a paper which is going to be submitted in the next few 10 days, I will let you know as soon as it's been deposited to that I can update the reference. And apologies for the long text but I thought it would be best to address everything in one long comment.

As a side note, I think there is a small issue with typesetting the figures in the paper (Table 1). Would it be possible to reduce the text size or change the width by a little so that the code blocks display as one liners. Otherwise I could move them to a different layout.

Output from the cloc command (local run, commit id: 0da044afc3c6d7bad0b60f54dcf21ba2fb6374be).

      54 text files.
      54 unique files.
      21 files ignored.

github.com/AlDanial/cloc v 1.74  T=0.52 s (69.8 files/s, 4497.8 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                          21            364            369           1097
Markdown                         9            117              0            206
YAML                             3             11              5             71
CSS                              1              7              7             61
INI                              1              0              0              2
HTML                             1              0              0              2
-------------------------------------------------------------------------------
SUM:                            36            499            381           1439
-------------------------------------------------------------------------------

@danielskatz
Copy link

@whedon check repository

@whedon
Copy link
Author

whedon commented Nov 16, 2020

Software report (experimental):

github.com/AlDanial/cloc v 1.84  T=0.10 s (498.5 files/s, 83480.9 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
SVG                              1              0              0           4607
Python                          21            365            370           1102
Markdown                        10            112              0            198
Jupyter Notebook                 7              0            911            177
YAML                             3             11              5             71
CSS                              1              7              7             61
TeX                              1              5              0             47
reStructuredText                 3             37             68             40
INI                              1              0              0              2
HTML                             1              0              0              2
-------------------------------------------------------------------------------
SUM:                            49            537           1361           6307
-------------------------------------------------------------------------------


Statistical information for the repository '1f383df63cf604807d3377a9' was
gathered on 2020/11/16.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Maik Riechert                    2           242             37           10.05
Thomas Nagler                    2           137             34            6.16
dmey                            19          1927            398           83.78

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
Maik Riechert               241           99.6          0.1                2.90
Thomas Nagler                87           63.5          0.8               24.14
dmey                       1509           78.3          0.0                8.88

@danielskatz
Copy link

@openjournals/dev - any comments on this question from the author:

As a side note, I think there is a small issue with typesetting the figures in the paper (Table 1). Would it be possible to reduce the text size or change the width by a little so that the code blocks display as one liners. Otherwise I could move them to a different layout.

@danielskatz
Copy link

👋 @oliviaguest - would you be willing to edit this for JOSS?

@danielskatz
Copy link

@whedon invite @oliviaguest as editor

@whedon
Copy link
Author

whedon commented Nov 16, 2020

@oliviaguest has been invited to edit this submission.

@danielskatz danielskatz removed the query-scope Submissions of uncertain scope for JOSS label Nov 16, 2020
@oliviaguest
Copy link
Member

I am really inundated with work at the moment, so on the proviso I can start (looking for reviewers, etc.) next week, sure. ☺️

@danielskatz
Copy link

Sure, that's fine!

@danielskatz
Copy link

@whedon assign @oliviaguest as editor

@whedon
Copy link
Author

whedon commented Nov 16, 2020

OK, the editor is @oliviaguest

@oliviaguest
Copy link
Member

@dmey can you give me some ideas for reviewers, please? 😊

@oliviaguest
Copy link
Member

@whedon add @khinsen as reviewer

@whedon whedon assigned khinsen and oliviaguest and unassigned oliviaguest Nov 23, 2020
@whedon
Copy link
Author

whedon commented Nov 23, 2020

OK, @khinsen is now a reviewer

@dmey
Copy link

dmey commented Nov 24, 2020

@dmey can you give me some ideas for reviewers, please? 😊

@oliviaguest, sure -- from the list of potential reviewers I have identified the following people who may be suitable to review this submission:

@sbrugman, @Scivision, @terrytangyuan, @malmaud, @stsievert, @seba-1511, @yzhao062,
@NicolasHug, @glemaitre, @JesperDramsch, @zaxtax, @mnarayan, @justusschock

In case none is available please let me know and I can suggest more.

@mnarayan
Copy link

I would be happy to review.

@oliviaguest
Copy link
Member

@whedon add @mnarayan as reviewer

@whedon
Copy link
Author

whedon commented Nov 25, 2020

OK, @mnarayan is now a reviewer

@oliviaguest
Copy link
Member

@whedon start review

@whedon
Copy link
Author

whedon commented Nov 25, 2020

OK, I've started the review over in #2863.

@whedon whedon closed this as completed Nov 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants