Skip to content

Restructure python directory #4098

Description

@adrifoster

In discussion with @ekluzek I propose we reorganize our python/ctsm directory into a more conventional python package layout. Essentially this means:

  1. Group the 20+ files inside python/ctsm/ into grouped modules/subdirectories, like already exists for site_and_regional, crop_calendars, etc.
  2. Move the test suite (currently in python/ctsm/test/) out of the main package and into python/tests/, and then group into unit/ and sys/ tests and then inside that mirror the new directory structure of python/ctsm/.
  3. Make pyproject.toml an actual build configuration, and then also expose our command-line tools as [project.scripts] entries, replacing all our copy-pasted skeleton wrappers in tools. (though actually see some potential problems with the command-line entry tools below)

Motivation

For the grouping: Right now a bunch of modules sit in one giant folder, alongside some that are already organized. This makes it hard to tell immediately what relates to what, and we may get import collisions as we continue to develop and add more modules.

For the tests: The test files (distinguished by test_unit_.. and test_sys_...) are also in one giant file inside the package. This is non-standard and also makes it awkward to distinguish between the fast unit tests and slower sys tests that require input data. And then anything that imports ctsm also pulls in all these tests.

For the tools/ wrappers: These skeleton wrappers are nice in that they separate the calling exectuable from the actual program and make it more easily accessible, but they all share the same copy-pasted boilerplate code:

_CTSM_PYTHON = os.path.join(os.path.dirname(os.path.realpath(__file__)),
                            os.pardir,
                            os.pardir,
                            'python')
sys.path.insert(1, _CTSM_PYTHON)

from ctsm.crop_calendars.process_ggcmi_shdates import main

if __name__ == "__main__":
    main()

This hard-codes the relative path from the tool back to the python/ directory. So every new tool needs to copy-paste this code, and if the structure ever change we have to change all the instances of it.

If we actually have a pyproject.toml file that builds our package (right now it just sets up black), we can set up project scripts, as well as have depencencies, a package version, as well as some other nice things like pylint and pytest settings.

Proposed new structure

Here is the structure I propose, though we may want to work out together the exact sub-directories and what goes in them under the ctsm/ dir.

python/
├── ctsm/
│   ├── crop_calendars/        # already exists
│   ├── joblauncher/           # already exists
│   ├── modify_input_files/    # already exists
│   ├── param_utils/           # already exists
│   ├── site_and_regional/     # already exists
│   ├── toolchain/             # already exists
│   ├── shared/                # utils, arg_utils, logging, etc, (?)
│   ├── mesh/               
│   ├── lilac/    
│   ├── machine/                          
│   │   etc....
├── tests/
│   ├── unit/  # mirrors ctsm structure
│   ├── sys/   # mirrors ctsm structure
├── pyproject.toml
├── conda_env_ctsm_py.yml
├── other needed files...
└── README.md

This module-to-package mapping under ctsm/ is just a starting point, and we should discuss it together.

The unit vs. sys split in tests/ is used instead of test_unit_{module}py vs. test_sys_{module}py, and should
make it eaiser to distinguish and call these tests separately.

The pyproject.toml would have:

[project.scripts]
subset_data = ctsm.site_and_regional.subset_data:main`
mesh_maker = ctsm.mesh.mesh_maker:main
... etc.

which is nice because it's one list of all our CLI tools, and removes the need for our skeleton wrappers and the sys.path hack. Personally, as a user, I find it really annoying to have to navigate to the location of the executable inside tools/ and then even more annoying to have to navigate back out to find the actual file where the code exists if anything goes wrong of if I need to read the code.

By adding the package buid and these scripts with the pyproject.toml, we would also want to add a pip install -e to our conda env yaml file so that everything gets built with ctsm_pylib.

Caveats/Problems (and possible other solution)

In talking with @billsacks , it seems that these command-line entry points (along with the package install) introduce a potential huge problem for some use cases. This isn't something I personally do, but if you have multiple "versions" (i.e. directories) of the CTSM repo for different projects/branches/etc., this would introduce some potential (silent!) errors. The command-line entry points set up in [project.scripts] are tied to the environment they was installed in, not whatever directory you are in. So, for example, if you have two CTSMs with different python/ctsm/site_and_regional/subset_data.pys, if you just call subset_data that version maps (via $PATH to envs/ctsm_pylib/bin/subset_data) to whatever directory you installed it in, so you would need to know that and I suppose keep track of all that yourself....

I think this might be fine for developers (?), but for naive users who also do this, this makes me nervous. Apparently folks working on CIME are also talking about stuff like this and they haven't been able to come up with a solution.

One thing we could do instead is to keep the wrappers in tools/ but generate them automatically
with a tool that uses a template. This could run in a --check mode in our github workflows that checks to make sure all the tools are included and correctly set up. This doesn't fix the "it's annoying to have to use the full path to the executable script" problem but perhaps that's just what we need to do for this type of package where people are using multiple directories/versions of it all over the place. This would also mean we wouldn't install the package in ctsm_pylib and we would have to go back to just using our sys.path hacks, but I think we can still fill out pyproject.toml more, which could help with CI workflows.

But I think we should talk more about this... @ekluzek, @billsacks

Rollout problems

So restructuring everything will break existing branches people have working on our tools - so we need to be careful about this. I'm not sure there is an easy solution to this. We can use git mv but I don't think git will be able to figure it out automatically if someone else has updates in place to the file pre-move.

Metadata

Metadata

Assignees

No one assigned

    Labels

    b4bbit-for-bitcode healthimproving internal code structure to make easier to maintain (sustainability)nextthis should get some attention in the next week or two. Normally each Thursday SE meeting.
    No fields configured for Code Health.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions