Description
Related issues — #676, #2473, #3211.
1. Summary
When I run Codespell for my project, I get 14 false positives for URLs. I’m sure that default Codespell behavior for URLs need corrections.
2. MCVE
2.1. Reproducibility
I reproduced the problem on Ubuntu and Windows.
[NOTE] I get different behavior for Ubuntu and Windows. By default, for Ubuntu Codespell prints to console a colored output, for Windows — non-colored.
2.2. Files
KiraURLs.txt
— the file with false positives from my real project:
https://browsersl.ist/
https://archive.fo/JaBJt
https://magisteria.ru/autor/andrey_vinogradov
https://woerterbuchnetz.de/?sigle=DWB&lemid=U14452
https://www.pucp.edu.pe/profesor/patricia-gonzales-gil
https://zoon.ru/msk/p-doctor/marina_vladimirovna_kuznetsova/
https://persona.rin.ru/view/f/0/25814/kolesov-evgenij-nikolaevich
https://broker.ru/studing/seminar/303
https://www.wirtschaftsdienst.eu/autor/walter-a-s-koch.html
https://web.archive.org/web/20160306044838/https://xpomo.com/ruskolan/tolpa/demagog.htm
https://base.garant.ru/10180093/
https://web.archive.org/web/20210509023444/https://portal.slac.stanford.edu/sites/inc_public/Pages/folder-file-names.aspx
https://portal.slac.stanford.edu/sites/inc_public/Pages/folder-file-names.aspx
https://www.digitalo.cz/dokument/qBhi6rSjJ87tFJ6y/plny-text.pdf#page=102
codespell.cfg
:
[codespell]
ignore-regex = https?://[^\s`"<>^\\{|}]+
[NOTE] The uri-regex
key with the same value has no effect. It’s a possible bug.
2.3. Commands
codespell --config codespell.cfg KiraURLs.txt
codespell KiraURLs.txt
2.4. Behavior
2.4.1. Expected — my regex
No output. Codespell ignores all URLs.
2.4.2. Non-desired — default Codespell behavior
codespell KiraURLs.txt
KiraURLs.txt:1: ist ==> is, it, its, it's, sit, list
KiraURLs.txt:2: fo ==> of, for, to, do, go
KiraURLs.txt:3: autor ==> author
KiraURLs.txt:4: sigle ==> single, sigil
KiraURLs.txt:5: profesor ==> professor
KiraURLs.txt:6: zoon ==> zoom
KiraURLs.txt:7: rin ==> ring, rink, rind, rain, rein, ruin, grin
KiraURLs.txt:8: studing ==> studying
KiraURLs.txt:9: autor ==> author
KiraURLs.txt:10: demagog ==> demagogue
KiraURLs.txt:11: garant ==> guarantor, guardant
KiraURLs.txt:12: slac ==> slack
KiraURLs.txt:13: slac ==> slack
KiraURLs.txt:14: dokument ==> document
Command exited with code 65
False positives for each line.
3. Regular expression for URLs
For my project I use this regular expression for URLs:
https?://[^\s`"<>^\\{|}]+
Perhaps it will help another users.
3.1. Advantages of my regex
- This expression doesn’t match correctly examples of Mathias Bynens, but is sufficient for practical Codespell usage in my case.
- It’s simple and shouldn’t slow down Codespell too much.
3.2. Explanation of my regex
It allowed any Unicode symbols in URLs except:
\s
— for the Python regular expression syntax\s
metacharacter “includes[ \t\n\r\f\v]
and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages”.- Another forbidden ASCII characters in URLs.
4. Environment
-
Operating system:
- Local — Microsoft Windows [Version 10.0.22621.3085]
- Travis CI — Ubuntu 24.04.2 LTS Noble Numbat
- AppVeyor CI — Microsoft Windows [Version 10.0.17763.6189]
-
Python:
- Local — 3.13.2
- Travis CI — 3.13.1
- AppVeyor CI — 3.12.8
-
Codespell 2.4.1
Thanks.