Skip to content

Proposal: Scan deduction and summarization #377

@pombredanne

Description

@pombredanne

Context

Scanning operates at the file level. This is good but in many cases a scan reports too much data at a too detailed level. This happens when related clues are detected across files or inside the same file.

Problem

Multiple related clues in different files

For instance, if every file in a directory tree has the same license and copyright statements, then the license and origin information could be rolled up at the level of this directory and the file details could be omitted.

Or say that a scanned directory only contains a COPYING file with a license and notice and none of the files in that directory have a license or copyright. Then the license and origin information could be extended from the COPYING to all the files in that tree.

Or say that a scanned directory only contains a README file with a license and notice and that all the files in that directory have a comment See README for licensing. Then the license and origin information could be extended from the README to all the files in that tree that carry this comment.

Or say that a Package is detected (such as Maven Jar or an NPM or else) and that the package-level metadata accurately described the licensing of all the files for this package and that the scan of the files in this package does not bring new details. Then only the license and origin information from the package could be kept and the file details omitted.

Or say that a directory contains code in a mix of programming languages: the primary or main language or language stats could be rolled up at the directory level.

Or say that a directory contains both code and build scripts and that the license for the build scripts is different from that of the code (say this is some autotools MIT or FSF notice). Then the licenses for the directory could be summarized based on a classification of the code files, and the build scripts and the build script licenses would not be reported as the directory or package license.

Multiple related clues in the same file

Some scans operate on the same data in a given file and this may trigger reporting extra or spurious clues and could be instead considered together.

For instance a license text may contain a copyright statement for the text of the license and URLs and emails. Detecting licenses, copyrights, emails and urls could report four different clues in same scanned file and scanned text region when this is may be instead a single clue for the license that should be reported and not four clues.

Or a package metadata file would typically contains origin and license information and these would end up reported twice both as package attributes and individual detection for license, copyright and urls.

Solution elements

A comprehensive solution may cover some or all of these:

  • determine where to summarize and roll up clues. For instance, rolling everything at the root directory level would rarely make sense; instead rolling things up at a package level and finding what would be a good directory level to use as a break point would be important
  • implement some classification of files such as test, code proper, build scripts, test code, etc.
  • implement some statistics, rules and/or machine learning to summarize and deduct proper higher-level origin and license.
  • scan all the clues togther in order to combine (and filter) them properly
  • combine package detection with license and copyright detection

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions