Pydocstyle crashes on literal strings in pyproject.toml #600
Description
Problem
When an unrelated configuration (ex semantic_release
) has a literal string, such as a regular expression, in the configuration denoted by """
, pydocstyle will throw a toml.decoder.TomlDecodeError
for an unterminated string. This likely does not happen with every literal string but causes errors when there is a single quote inside the regexp.
My offending config:
# pyproject.toml
[tool.semantic_release]
version_pattern = [
# regular expression to find version value in `_version.py` file
'''src/pkg1/_version.py:__version__[ ]*[:=][ ]*["'](\d+\.\d+\.\d+)["']'''
]
[tool.pydocstyle]
convention = 'pep257'
Log
(venv) $ pydocstyle scripts/prepare.py
Traceback (most recent call last):
File "/workspaces/py-rpm/venv/bin/pydocstyle", line 8, in <module>
sys.exit(main())
File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/pydocstyle/cli.py", line 75, in main
sys.exit(run_pydocstyle())
File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/pydocstyle/cli.py", line 41, in run_pydocstyle
for (
File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/pydocstyle/config.py", line 288, in get_files_to_check
config = self._get_config(os.path.abspath(name))
File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/pydocstyle/config.py", line 369, in _get_config
config = self._get_config_by_discovery(node)
File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/pydocstyle/config.py", line 318, in _get_config_by_discovery
config = self._get_config(parent_dir)
File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/pydocstyle/config.py", line 369, in _get_config
config = self._get_config_by_discovery(node)
File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/pydocstyle/config.py", line 312, in _get_config_by_discovery
config_file = self._get_config_file_in_folder(path)
File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/pydocstyle/config.py", line 555, in _get_config_file_in_folder
if config.read(full_path) and cls._get_section_name(config):
File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/pydocstyle/config.py", line 70, in read
self._config.update(toml.load(fp))
File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/toml/decoder.py", line 156, in load
return loads(f.read(), _dict, decoder)
File "/workspaces/py-rpm/venv/lib/python3.8/site-packages/toml/decoder.py", line 362, in loads
raise TomlDecodeError("Unterminated string found."
toml.decoder.TomlDecodeError: Unterminated string found. Reached end of file. (line 121 column 1 char 2619)
Investigation
This seems to be a limitation of the parser implementation and associated TOML standard. I looked at the dependency trees of semantic_release
and found that they use the library tomlkit
instead of toml
because it supports v1.0.0
of the TOML standard instead of v0.5.0
. Under the hood, it seems there is a few flaws with the parser in toml==0.5.0
since I can change the regular expression in different variations and get different but not obvious/expected results. One such oddity, inside a the triple single quotes '''
if you have two double quotes "
somewhere within it, it will cause an Unterminated string error, but if only one exists it is fine. The other variation that shouldn't work but does, is escaping the double quotes (ie. \"
) and it is fine.
I also found that the toml
library itself is stale and has not received any updates since Oct 2020. Whereas tomlkit
and its competitor tomli
have both received updates in the 1st half of 2022. Furthermore, python3.11
also highlights these two frontrunners as the ideal libraries to read/write toml in the Python docs. Maybe in a year future you can use the python3.11
built-in library tomllib
but clearly that would be incompatible for a few years.
Additional discussion on TOML support for raw/literal strings: toml-lang/toml#80
Recommendation
Switch toml
dependency to tomlkit
or tomli
.
I have tested both of the variations tomli==2.0.1
and tomlkit==0.10.2
and both parse my pyproject.toml
configuration file (as provided above) with regex correctly without error. tomlkit
does seem to be leading in popularity but the tomli
documentation is a bit better. Also of note, tomli.load()
requires the file to have been opened for reading in bytes instead of a specified encoding.
Related: #599