Description
Syft License Revamp
Syft currently represents license as different datatypes depending on the section of the schema it appears at:
AlpmMetadata: string (required)
ApkMetadata: string (required)
GemMetadata: []string (optional)
NpmPackageJSONMetadata: []string (required)
Package: []string (required)
PhpComposerJSONMetadata: []string (optional):
PythonPackageMetadata: string (required)
RpmMetadata: string (required)
Specifically, the package
[]string
construct has proven to be a bit limited in how the data can be represented to a user interested in license compliance. Many packages now use SPDX LICENSE ID
to communicate FOSS license information. These identifier are currently incompatible with how we represent license given the complex nature of some of the constructs. Example:
// SPDX-License-Identifier: Apache-2.0 AND (MIT OR GPL-2.0-only)
NOTE FROM COMMUNITY MEET:
- String will not be deprecated, but possibly the AST (abstract syntax tree) will be the preferred representation
The above shows a case where the consumer of the software can choose to use Apache-2.0 and one of the following: MIT, OR GPL-2.0-only.
The file is subject to both the Apache-2.0 license, and at the licensee’s choice either the MIT license or version 2.0 only of the GPL.
The licensee may choose between MIT and GPL-2.0.
Whichever they choose, they must comply with both that license and Apache-2.0.
Furthermore, syft's current licenses format is limited in representing the distinction between DECLARED
vs CONCLUDED
The SPDX format gives implementers the choice in determining if a license should be in the concluded
license field or the declared
license field:
Concluded
TODO: Update this description based on feedback from community meeting
Contain the license the SPDX document creator has concluded as governing the package or alternative values, if the governing license cannot be determined.
If the Concluded License is not the same as the Declared License (7.15), a written explanation should be provided in the Comments on License field (7.16). With respect to NOASSERTION, a written explanation in the Comments on License field (7.16) is preferred. If the Concluded License field is not present in a package, it implies an equivalent meaning to NOASSERTION.
Declared
List the licenses that have been declared by the authors of the package. Any license information that does not originate from the package authors, e.g. license information from a third-party repository, should not be included in this field.
Syft's approach
Syft should enhance the license representation from []string
to []License
in order to convey the above information more clearly. The following struct will be added in favor of string
to give downstream tooling more options in accurately reading how the license was determined at syft's run:
type Licenses struct {
SPDXExpression string // expression used to derive the below licenses
Licenses: []License // licenses and their give metadata
}
type License struct {
Name string
Location Location
Concluded bool // if false then we can assume decalred? NOTE: Update this from meeting notes about when we should declared concluded
Confidence int
Offset int
Extent int
}
type Location struct {
Path string
LayerID string
}
Here is a sample of the json representation of the above:
{
"spdxLicenseExpression":"mit AND (LGPL-2.1-or-later OR BSD-3-Clause)",
"licenses":[
{
"Name":"LGPL",
"location":{
"path":"/lib/apk/db/installed",
"layerID":"sha256:ded7a220bb058e28ee3254fbba04ca90b679070424424761a53a043b93b612bf"
},
"concluded":true,
"confidence":0.92,
"Offset":0,
"Extent":23829
}
]
}
In the event a license is successfully concluded the above uses google license classifier to accurately assess the license packaged with the software. If provides the confidence level (how close a match was given the locations contents compared to some source DB), the ofset (how far into the file the match was found), and the extent (how long the match was).
Why is this needed:
This enhancement is needed so syft can better represent SPDX license expression intentions, illustrate more data on where the license, if concluded, was found, and give downstream tools looking to use SBOM for license compliance more tooling/accuracy in assessing the license contents against policy they create.