This template repository helps make new Python projects easier to set up and more
uniform. It contains a lot of infrastructure surrounding a minimal Python package named
cheshire
(the cat who isn't entirely there...).
- Choose a name for the new package that you are creating.
- The name of the repository should be the same as the name of the new Python package
you are going to create. E.g. a repository at
catalyst-cooperative/cheshire
should be used to define a package namedcheshire
. - Fork this template repository to create a new Python project repo. See these instructions.
- Clone the new repository to your development machine.
- Run
pre-commit install
in the newly clone repository to install the pre-commit hooks defined in.pre-commit-config.yaml
- Create the
cheshire
conda environment by runningconda env create
or (preferably)mamba env create
in the top level of the repository. - Activate the new conda environment with
conda activate cheshire
. - Run
tox
from the top level of the repository to verify that everything is working correctly.
Once you know that your forked version of the cheshire
package is working as
expected, you should update the package and distribution names in your new repo to
reflect the name of your new package. The package name is determined by the name of
the directory under src/
which contains the source code, and is the name you'll use
to import the package for use in a program, script, or notebook. E.g.:
import cheshire
The distribution name is the name that is used to install the software using a
program like pip
, conda
, or mamba
. It is often identical to the package
name, but can also contain a prefix namespace that indicates the individual or
organization responsible for maintaining the pacakge. See PEP 423
PEP 423 for more on Python package naming
conventions. We are using the catalystcoop
namespace for the packages that we
publish, so our pudl
package becomes catalystcoop.pudl
in the
Python Package Index (PyPI) or on conda-forge
. Similarly the cheshire
package
becomes the catalystcoop.cheshire
distribution. The distribution name is determined
by the name
argument in the call to setup()
in setup.py
.
pip install catalystcoop.cheshire
The package and distribution names are referenced in many of the files in the template
repository, and they all need to be replaced with the name of your new package. You can
use grep -r
to search recursively through all of the files for the word cheshire
at the command line, or use the search-and-replace functionality of your IDE / text
editor. The name of the package directory under src/
will also need to be changed.
- Supply any required tokens, e.g. for CodeCov
- Rename the
src/cheshire
directory to reflect the new package name. - Search for
cheshire
and replace it as appropriate everywhere. Sometimes this will be with a distribution name likecatalystcoop.cheshire
(the package as it appears forpip
orPyPI
) and sometimes this will be the importable package name (the name of the directory undersrc
e.g.cheshire
) - Create the new project / package at Read The Docs.
- The
src
directory contains the code that will be packaged and deployed on the user system. That code is in a directory with the same name as the package. - Using a separate
src
directory helps avoid accidentally importing the package when you're working in the top level directory of the repository. - A simple python module (
dummy.py
), and a separate module providing a command line interface to that module (cli.py
) are included as examples. - Any files in the
src/package_data/
directory will also be packaged and deployed. - What files are included in or excluded from the package on the user's system is
controlled by the
MANIFEST.in
file and some options in the call tosetup()
insetup.py
. - The CLI is deployed using a
console_script
entrypoint defined insetup.py
. - We use
setuptools_scm
to obtain the package's version directly fromgit
tags, rather than storing it in the repository and manually updating it. README.rst
is read in and used for the pacakge'slong_description
. This is what is displayed on the PyPI page for the package. For example, see the PUDL Catalog page.- By default we create at least three sets of "extras" -- additional optional package
dependencies that can be installed in special circumstances:
dev
,docs`
, andtests
. The packages listed there are used in development, building the docs, and running the tests (respectively) but aren't required for a normal user who is just installing the package frompip
orconda
. - Python has recently evolved a more diverse community of build and packaging tools.
Which flavor is being used by a given package is indicated by the contents of
pyproject.toml
. That file also contains configuration for a few other tools, includingblack
andisort
, described in the section on linters and formatters below.
- A skeleton pytest testing setup is included in the
tests/
directory. - Tests are split into
unit
andintegration
categories. - Session-wide test fixtures, additional command line options, and other pytest
configuration can be added to
tests/conftest.py
- Exactly what pytest commands are run during continuous integration controlled by Tox.
- Pytest can also be run manually without using Tox, but will use whatever your personal python environment happens to be, rather than the one specified by the package. Running pytest on its own is a good way to debug new or failing tests quickly, but we should always use Tox and its virtual environment for actual testing.
- We define several different test environments for use with Tox in
tox.ini
- Tox is used to run pytest in an isolated Python virtual environment.
- We also use Tox to coordinate running the code linters, building the documentation, and releasing the software to PyPI.
- The default Tox environment is named
ci
and it will run the linters, build the documentation, run all the tests, and generate test coverage statistics. tox.ini
also contains sections near the bottom which configure the behavior ofdoc8
,flake8
,pytest
, andrstcheck
.
- A variety of sanity checks are defined as git pre-commit hooks -- they run any time you try to make a commit, to catch common issues before they are saved. Many of these hooks are taken from the excellent pre-commit project.
- The hooks are configured in
.pre-commit-config.yaml
- For them to run automatically when you try to make a commit, you must install the pre-commit hooks in your cloned repository first. This only has to be done once.
- These checks are run as part of our CI, and the CI will fail if the pre-commit hooks fail.
- We also use the pre-commit.ci service to run the same checks on any code that is pushed to GitHub, and to apply standard code formatting to the PR in case it hasn't been run locally prior to being committed.
To avoid the tedium of meticulously formatting all the code ourselves, and to ensure as
standard style of formatting and sytactical idioms across the codebase, we use several
automatic code formatters, which run as pre-commit hooks. Many of them can also be
integrated direclty into your text editor or IDE with the appropriate plugins. The
following formatters are included in the template .pre-commit-config.yaml
:
- Use only absolute import paths
- Standardize the sorting of imports
- Remove unneccesary f-strings
- Upgrade type hints for built-in types
- Upgrade Python syntax
- Deterministic formatting with Black
- We also have a custom hook that clears Jupyter notebook outputs prior to committing.
To catch errors before commits are made, and to ensure uniform formatting across the codebase, we also use a bunch of different linters. They don't change the code or documentation files, but they will raise an error or warning when something doesn't look right so you can fix it.
- bandit identifies code patterns known to cause security issues.
- doc8 and rstcheck look for formatting issues in our docstrings
and the standalone ReStructuredText (RST) files under the
docs/
directory. - flake8 is an extensible Python linting framework, with a bunch of plugins.
- mypy Does static type checking, and ensures that our code uses type annotations.
- pre-commit has a collection of built-in checks that use
pygrep to search Python files for
common problems like blanket
# noqa
annotations, as well as language agnostic problems like accidentally checking large binary files into the repository or having unresolved merge conflicts. - hadolint checks Dockerfiles for errors and violations of best practices. It runs as a pre-commit hook.
- We use Tox and a the pytest coverage plugin to measure and record what percentage of our codebase is being tested, and to identify which modules, functions, and individual lines of code are not being exercised by the tests.
- When you run
tox
ortox -e ci
(which is equivalent) a summary of the test coverage will be printed at the end of the tests (assuming they succeed). The full details of the test coverage is written tocoverage.xml
. - There are some configuration options for this process set in the
.coveragerc
file in the top level directory of the repository. - When the tests are run via the
tox-pytest
workflow in GitHub Actions, the test coverage data from thecoverage.xml
output is uploaded to a service called CodeCov that saves historical data about our test coverage, and provides a nice visual representation of the data -- identifying which subpackages, modules, and individual lines of are being tested. For example, here are the results for the cheshire repo. - The connection to CodeCov is configured in the
.codecov.yml
YAML file. - In theory, we should be able to automatically turn CodeCov on for all of our GitHub
repos, and it just Just Work, but in practice we've had to turn it on in the GitHub
configuration one-by-one. Open source repositories are also supposed to be able to
upload to the CodeCov site without requiring authentication, but this also hasn't
worked, so thus far we've needed to request a new token for each repository. This
token is stored in
.codecov.yml
. - Once it's enabled, CodeCov also adds a couple of test coverage checks to any pull request, to alert us if a PR reduces overall test coverage (which we would like to avoid).
- We build our documentation using Sphinx.
- Standalone docs files are stored under the
docs/
directory, and the Sphinx configuration is there inconf.py
as well. - We use Sphinx AutoAPI to
convert the docstrings embedded in the python modules under
src/
into additional documentation automatically. - The top level documentation index simply includes this
README.rst
, theLICENSE.txt
andCODE_OF_CONDUCT.md
files are similarly referenced. The only standalone documentation file underdocs/
right now is therelease_notes.rst
. - Unless you're debugging something specific, the docs should always be built using
tox -e docs
as that will lint the source files usingdoc8
andrstcheck
, and wipe previously generated documentation to build everything from scratch. The docs are also rebuilt as part of the normal Tox run (equivalent totox -e ci
). - If you add something to the documentation generation process that needs to be cleaned
up after, it should be integrated with the Sphinx hooks. There are some examples of
how to do this at the bottom of
docs/conf.py
in the "custom build operations" section. For example, this is how we automatically regenerate the data dictionaries based on the PUDL metadata whenever the docs are built, ensuring that the docs stay up to date.
- We use the popular Read the Docs service to host our documentation.
- When you open a PR, push to
dev
ormain
, or tag a release, the associated documentation is automatically built on Read the Docs. - There's some minimal configuration stored in the
.readthedocs.yml
file, but setting up this integration for a new repository requires some setup on the Read the Docs site. - Create an account on Read the Docs using your GitHub identity, go to "My Projects" under the dropdown menu in the upper righthand corner, and click on "Import a Project." It should list the repositories that you have access to on GitHub. You may need to click on the Catalyst Cooperative logo in the right hand sidebar.
- It will ask you for a project name -- this will become part of the domain name for the
documentation page on RTD and should be the same as the distribution name, but with
dots and underscores replaced with dashes. E.g.
catalystcoop-cheshire
orcatalystcoop-pudl-catalog
. - Under Advanced Settings, make sure you enable builds on PRs. This will add a check ensuring that the documentation has built successfully on RTD for any PR in the repo.
- Under the Builds section for the new project (repo) you'll need to tell it which
branches you want it to build, beyond the default
main
branch. - Once the repository is connected to Read the Docs, an initial build of the
documentation from the
main
branch should start.
We use GitHub's Dependabot
to automatically update the allowable versions of packages we depend on. This applies
to both the Python dependencies specified in setup.py
and to the versions of the
GitHub Actions that we employ. The dependabot
behavior is configured in .github/dependabot.yml
Under .github/workflows
are YAML files that configure the GitHub Actions associated with the repository. We use GitHub
Actions to:
- Run continuous integration using tox on several different versions of Python.
- Build a Docker container with repo2docker
which encapsulates the conda environment defined by the top level
environment.yml
Note that for this action to succeed, you will need to create a personal access token on Docker Hub and create new repository secrets to store your username and token calledDOCKERHUB_USERNAME
andDOCKERHUB_TOKEN
and make sure that the Docker Hub repository you're trying to push to exists. - Build a Docker container directly and push it to Docker Hub using the docker-build-push action.
Catalyst Cooperative is a small group of data wranglers and policy wonks organized as a worker-owned cooperative consultancy. Our goal is a more just, livable, and sustainable world. We integrate public data and perform custom analyses to inform public policy (Hire us!). Our focus is primarily on mitigating climate change and improving electric utility regulation in the United States.
- For general support, questions, or other conversations around the project that might be of interest to others, check out the GitHub Discussions
- If you'd like to get occasional updates about our projects sign up for our email list.
- Want to schedule a time to chat with us one-on-one? Join us for Office Hours
- Follow us on Twitter: @CatalystCoop
- More info on our website: https://catalyst.coop
- For private communication about the project or to hire us to provide customized data extraction and analysis, you can email the maintainers: pudl@catalyst.coop