Skip to content

Conversation

@Liam-DeVoe
Copy link
Contributor

Closes #928.

I've tested this with the following hypothesis test:

Details
import lark
from hypothesis import given, settings
from hypothesis.extra.lark import from_lark
from packaging.metadata import licenses

# naming and order follows
# https://spdx.github.io/spdx-spec/v2.2.2/SPDX-license-expressions/
grammar = """
start: license_expression

idstring: (ALPHA | DIGIT | "-" | ".")+
# a few hardcoded test licenses and exceptions
license_id: "MIT"
          | "Apache-2.0"
          | "GPL-3.0"
          | "BSD-3-Clause"
          | "ISC"
license_exception_id: "LLVM-exception"
                    | "Classpath-exception-2.0"
                    | "GCC-exception-3.1"

# DocumentRef is a valid prefix according to
# https://spdx.github.io/spdx-spec/v2.2.2/SPDX-license-expressions/, but
# we don't implement it.
# license_ref: ("DocumentRef-" idstring ":")? "LicenseRef-" idstring
license_ref: "LicenseRef-" idstring

simple_expression: license_id
                 | license_id "+"
                 | license_ref
compound_expression: simple_expression
                   | simple_expression " WITH " license_exception_id
                   | compound_expression " AND " compound_expression
                   | compound_expression " OR " compound_expression
                   | "(" compound_expression ")"
license_expression: simple_expression | compound_expression

%import common.LETTER -> ALPHA
%import common.DIGIT
"""

parser = lark.Lark(grammar)

@given(from_lark(parser))
@settings(deadline=None)
def test_canonicalize_license_expression(expression):
    licenses.canonicalize_license_expression(expression)

I'm very happy to add this test to the suite if you think it's worth the hypothesis dependency.

Comment on lines +687 to +692
# we don't canonicalize redundant parens, instead leaving them as-is
# in the license expression.
("(MIT)", "(MIT)"),
("((MIT))", "((MIT))"),
("(( MIT ))", "((MIT))"),
("((MIT AND (MIT)))", "((MIT AND (MIT)))"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could make it so redundant parens are canonicalized, but we'd have to modify the parser to be a proper parser instead of deferring to python's compile.

Copy link
Contributor

@henryiii henryiii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense as a first step; longer term I assume we should canonicalize parens?

@Liam-DeVoe
Copy link
Contributor Author

Liam-DeVoe commented Oct 8, 2025

I thought so at the time, but I'm actually not sure! Here's what PEP 639 has to say:

If tools choose to validate the SPDX expression, they also SHOULD store a case-normalized version of the License-Expression field using the reference case for each SPDX license identifier and uppercase for the AND, OR and WITH keywords.

which reads to me that parentheses should not be normalized. However, this might simply be an oversight.

If we do want to normalize parens, we'll have to decide what to do about technically-unnecessary parens for explicit operator precedence, for example a AND b OR c <-> (a AND b) OR c. Some might complain if this were normalized to the less-readable no-parens version, though it would certainly make the normalizing logic simpler.

@henryiii henryiii merged commit a1f7056 into pypa:main Oct 29, 2025
39 checks passed
@Liam-DeVoe Liam-DeVoe deleted the nested-parens branch October 29, 2025 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SPDX license specification doesn't work with nested parentheses

2 participants