Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve license URL detection #2679

Open
pombredanne opened this issue Aug 27, 2021 · 2 comments
Open

Improve license URL detection #2679

pombredanne opened this issue Aug 27, 2021 · 2 comments
Labels
enhancement improve-license-detection license scan live-online-scan Anything that requires a live, online netwrokd access (and would not workd in an isolated network)

Comments

@pombredanne
Copy link
Member

Based on a report by @LeChasseur and @armijnhemel
There is a large number of URLs that reference licenses and we have a good number of them as detection rules.
These rules are qualified as "is_license_reference"

  1. It would be better to mark them as "is_license_url" so we can distinguish them
  2. since there is a very large number of these possible we could optimize a secondary index and matching technique just for these
  3. some of these may match patterns such things based https://github.com/remy/mit-license or license badges
  • for these we may have a special way to handle patterns
  1. some of these contain extra information at the referenced URL page.
  • we do not want to live fetch them in scancode-toolkit, but tagging these with "is_license_url" means we could later have an optional step in scancode.io pipelines that could follow the referenced URL, fetch, save and scan them, including collecting copyrights and other useful information that may live there.
@pombredanne
Copy link
Member Author

We could fetch live these IMHO in a scan with a proper warning.

@pombredanne pombredanne added the live-online-scan Anything that requires a live, online netwrokd access (and would not workd in an isolated network) label Jan 9, 2023
@pombredanne
Copy link
Member Author

We could also fetch from Internet Archive if we can guess what is the date of a the code scanned. We have seen links change overtime with different licenses at the same page in particular but not only the Oracle and Microsoft licenses, as well as the ubiquitous GPL and LGPL from FSF and several licenses from the OSI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement improve-license-detection license scan live-online-scan Anything that requires a live, online netwrokd access (and would not workd in an isolated network)
Projects
None yet
Development

No branches or pull requests

1 participant