Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-organizing VASP I/O code #3997

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

Re-organizing VASP I/O code #3997

wants to merge 2 commits into from

Conversation

mkhorton
Copy link
Member

Summary

Re-opening changes initially introduced by @janosh in #3865 and later reverted by @shyuep (see thread).

I will be taking responsibility for this PR. This PR will require approval from both @janosh and @shyuep before merging.

Specifically,

  • For each module changed, I will manually inspect __dir__() to make sure all imports remain unaffected and that there are no backwards-incompatible changes.
  • Also split pymatgen.io.vasp.inputs and pymatgen.io.vasp.outputs into separate files.

This change is well-motivated to improve maintainability.

I hope we can get this PR merged without too much difficulty. Thank you again to @janosh for starting this PR, to @Andrew-S-Rosen for the bug report due to the errors caused, and to @shyuep for implementing a rapid fix.

@mkhorton mkhorton requested review from shyuep and janosh as code owners August 12, 2024 22:29
@mkhorton mkhorton changed the title Re-organizing Re-organizing VASP I/O code Aug 12, 2024
@mkhorton mkhorton marked this pull request as draft August 12, 2024 22:30
@@ -0,0 +1,107 @@
# ruff: noqa: PGH003

This comment was marked as resolved.

@shyuep
Copy link
Member

shyuep commented Aug 13, 2024

Here are my comments:

  1. Axiom 1: No. of lines is not a good metric of whether a module should be broken up. In even the most basic IDEs, there are tools to show the class structure, not to mention keyboard shortcuts to jump to any class/method name/line. I can't recall when the last time I had to use mouse scrolling for any file was. Note that the code files in vasp io are hardly the largest ones in pymatgen.
  2. Given Axiom 1, the only reason to break up a module into separate pieces is if there is a logical consistency to the subparts. While I can get onboard with breaking up sets.py (which has many classes), inputs.py and outputs.py should stay as they are. There is no good reason to split them up. Inputs only has 5 classes and they all pertain to vasp inputs.
  3. Even when breaking up sets.py, there is a certain inconsistency to the current choice. Why should input sets be split based on who wrote them, i.e., MP, MIT, MVL? Why not based on type of calculation (relaxation, static, electronic structure, NEB, etc.)? Do people select input sets based on organization? Isn't it more logical that they select an input set based on what they want to do?
  4. In so far as module subdivision is merely a developer-related detail (which it seems you are suggesting it is, since the primary justification was "ease of maintenance") and not a user API detail, the modules should all be private, i.e., with a "_" prefix. This gives the flexibility to completely reorganize when needed without some user seeing his code from pymatgen.vasp.io.sets.mp import MPRelaxSet suffer from backwards incompatible breakages. I do not want to see anyone using that line - it must be from pymatgen.vasp.io.sets import MPRelaxSet, even if the sets reside in separate files.

I refer you to the scikit-learn module and class organization. Please use that as a guide on how to organize. Finally, I would point out that many of sklearn's module files are about the same length as sets.py (2800-3200 lines). They do not break up a module for maintenance - they break it up when there is a logic to the organization.

@mkhorton
Copy link
Member Author

The sklearn comparison is illustrative: even where there are large LOC, often there is only a single class per file, and they are quite readable. This is not the case for these specific files in pymatgen. I am not making axiomatic statements; I agree that a high number of LOC is not in itself an error.

Long-time pymatgen developers (we are in good company) might know exactly where in the file to go to make an edit, but before you can get there, one has to read and understand the structure of the code. It is nice to know what classes are available, for example, without needing to read the entire file first.

Point 3 I concur; perhaps it is better to split up the input sets according to functionality instead.

@janosh
Copy link
Member

janosh commented Aug 15, 2024

Re 3, i think the most common use case is to have consistent settings across whatever mixture of workflows you run (relax, band gap, EOS, phonon, etc.). so after having run some number of relaxations, a user is unlikely to care about half a dozen other band gap workflows with inconsistent VASP settings to their relaxations. more likely, the only thing they want to know is if a band gap workflow from the same lineage (say MP) exists. hence splitting input sets by lineage makes sense imo.

i guess it doesn't have to be one or the other. if you look at atomate2/vasp/jobs, it has a mixture of splits by lineage (mp.py, matpes.py, lobster.py, amset.py) and by workflow type (md.py, elastic.py, phonon.py, ...)

@shyuep
Copy link
Member

shyuep commented Aug 21, 2024

If splitting by source is preferred, I am ok with that. Note that I don't really think atomate2's organizational structure should be the reference here. In any case, if the module split is a private implementation, then it really doesn't matter since we can always reorganize as needed without breaking downstream codes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants