Skip to content

bug/feature_request(urls): false positives for URLs #3677

Open
@Kristinita

Description

@Kristinita

Related issues — #676, #2473, #3211.

1. Summary

When I run Codespell for my project, I get 14 false positives for URLs. I’m sure that default Codespell behavior for URLs need corrections.

2. MCVE

2.1. Reproducibility

I reproduced the problem on Ubuntu and Windows.

  1. Ubuntu build on Travis CI
  2. Windows build on AppVeyor CI

[NOTE] I get different behavior for Ubuntu and Windows. By default, for Ubuntu Codespell prints to console a colored output, for Windows — non-colored.

2.2. Files

KiraURLs.txt — the file with false positives from my real project:

https://browsersl.ist/
https://archive.fo/JaBJt
https://magisteria.ru/autor/andrey_vinogradov
https://woerterbuchnetz.de/?sigle=DWB&lemid=U14452
https://www.pucp.edu.pe/profesor/patricia-gonzales-gil
https://zoon.ru/msk/p-doctor/marina_vladimirovna_kuznetsova/
https://persona.rin.ru/view/f/0/25814/kolesov-evgenij-nikolaevich
https://broker.ru/studing/seminar/303
https://www.wirtschaftsdienst.eu/autor/walter-a-s-koch.html
https://web.archive.org/web/20160306044838/https://xpomo.com/ruskolan/tolpa/demagog.htm
https://base.garant.ru/10180093/
https://web.archive.org/web/20210509023444/https://portal.slac.stanford.edu/sites/inc_public/Pages/folder-file-names.aspx
https://portal.slac.stanford.edu/sites/inc_public/Pages/folder-file-names.aspx
https://www.digitalo.cz/dokument/qBhi6rSjJ87tFJ6y/plny-text.pdf#page=102

codespell.cfg:

[codespell]

ignore-regex = https?://[^\s`"<>^\\{|}]+

[NOTE] The uri-regex key with the same value has no effect. It’s a possible bug.

2.3. Commands

codespell --config codespell.cfg KiraURLs.txt
codespell KiraURLs.txt

2.4. Behavior

2.4.1. Expected — my regex

No output. Codespell ignores all URLs.

2.4.2. Non-desired — default Codespell behavior
codespell KiraURLs.txt
KiraURLs.txt:1: ist ==> is, it, its, it's, sit, list
KiraURLs.txt:2: fo ==> of, for, to, do, go
KiraURLs.txt:3: autor ==> author
KiraURLs.txt:4: sigle ==> single, sigil
KiraURLs.txt:5: profesor ==> professor
KiraURLs.txt:6: zoon ==> zoom
KiraURLs.txt:7: rin ==> ring, rink, rind, rain, rein, ruin, grin
KiraURLs.txt:8: studing ==> studying
KiraURLs.txt:9: autor ==> author
KiraURLs.txt:10: demagog ==> demagogue
KiraURLs.txt:11: garant ==> guarantor, guardant
KiraURLs.txt:12: slac ==> slack
KiraURLs.txt:13: slac ==> slack
KiraURLs.txt:14: dokument ==> document
Command exited with code 65

False positives for each line.

3. Regular expression for URLs

For my project I use this regular expression for URLs:

https?://[^\s`"<>^\\{|}]+

Perhaps it will help another users.

3.1. Advantages of my regex

  1. This expression doesn’t match correctly examples of Mathias Bynens, but is sufficient for practical Codespell usage in my case.
  2. It’s simple and shouldn’t slow down Codespell too much.

3.2. Explanation of my regex

It allowed any Unicode symbols in URLs except:

  1. \s — for the Python regular expression syntax \s metacharacter “includes [ \t\n\r\f\v] and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages”.
  2. Another forbidden ASCII characters in URLs.

4. Environment

  1. Operating system:

    1. Local — Microsoft Windows [Version 10.0.22621.3085]
    2. Travis CI — Ubuntu 24.04.2 LTS Noble Numbat
    3. AppVeyor CI — Microsoft Windows [Version 10.0.17763.6189]
  2. Python:

    1. Local — 3.13.2
    2. Travis CI — 3.13.1
    3. AppVeyor CI — 3.12.8
  3. Codespell 2.4.1

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions