Skip to content

Copyright detection regression after implementing gibberish detection #4676

@JonoYang

Description

@JonoYang

Certain copyrights are not detected anymore as the gibberish detector identifies it as gibberish:

https://github.com/aboutcode-org/scancode-toolkit/blob/2402-detect-gibberish-copyright/tests/cluecode/data/copyrights/scilab-Scilab#L67

  • an instance of Scilab (c) INRIA-ENPC. was not detected
  • c) INRIA-ENPC. is identified as gibberish

https://github.com/aboutcode-org/scancode-toolkit/blob/2402-detect-gibberish-copyright/tests/cluecode/data/copyrights/misco4/linux-copyrights/Documentation/networking/arcnet-hardware.txt#L32

  • this did not detect Copyright Waterloo Microsystems Inc. 1985
  • @Copyright is identified as gibberish

https://github.com/aboutcode-org/scancode-toolkit/blob/2402-detect-gibberish-copyright/tests/cluecode/data/authors/trailing_date#L3C19-L3C59

  • Alexander Kanavin <alex.kanavin@gmail.com> was not detected
  • * : commit 3debe362faa62e5b381b880e3ba23aee07c85f6e Author: is detected as gibberish

Originally posted by @JonoYang in #4610 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions