Skip to content

feat: add reproducible central buildspec generation #1115

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

tromai
Copy link
Member

@tromai tromai commented Jul 2, 2025

Summary

This Pull Request adds a new command called gen-build-spec.

usage: macaron gen-build-spec [-h] -purl PACKAGE_URL --database DATABASE
                              [--output-format OUTPUT_FORMAT]

options:
  -h, --help            show this help message and exit
  -purl PACKAGE_URL, --package-url PACKAGE_URL
                        The PURL string of the software component to generate build
                        spec for.
  --database DATABASE   Path to the database.
  --output-format OUTPUT_FORMAT
                        The output format. Can be rc-buildspec (Reproducible-central
                        build spec) (default "rc-buildspec")

This command generates a buildspec, which contains the build related information for a PURL that Macaron has analyzed. The output file will be stored within output/<purl_based_path>/macaron.buildspec, where <purl_based_path> being the directory structure according to the input PackageURL.

An example

macaron analyze -purl pkg:maven/org.apache.hugegraph/computer-k8s@1.0.0
macaron gen-build-spec -purl pkg:maven/org.apache.hugegraph/computer-k8s@1.0.0 --database output/macaron.db

In this example, the final path to the output buildspec file is output/maven/org_apache_hugegraph/computer-k8s/macaron.buildspec

The content of output/maven/org_apache_hugegraph/computer-k8s/macaron.buildspec
, which uses the Reproducible Central buildspec format.

# Copyright (c) 2025, Oracle and/or its affiliates.
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
# Generated by Macaron version 0.15.0

# Input PURL - pkg:maven/org.apache.hugegraph/computer-k8s@1.0.0
# Initial default JDK version 8 and default build command [['mvn', '-DskipTests=true', '-Dmaven.test.skip=true', '-Dmaven.site.skip=true', '-Drat.skip=true', '-Dmaven.javadoc.skip=true', 'clean', 'package']].

groupId=org.apache.hugegraph
artifactId=computer-k8s
version=1.0.0

gitRepo=https://github.com/apache/hugegraph-computer

gitTag=d2b95262091d6572cc12dcda57d89f9cd44ac88b

tool=mvn
jdk=8

newline=lf

command="mvn -DskipTests=true -Dmaven.test.skip=true -Dmaven.site.skip=true -Drat.skip=true -Dmaven.javadoc.skip=true clean package"

buildinfo=target/computer-k8s-1.0.0.buildinfo

This Buildspec ideally can be used directly as part of the Reproducible Central rebuild infrastructure.

Description of changes

Macaron database extractor

The first step to generate a buildspec is to extract the build related information from an existing Macaron SQLite database. The module macaron_db_extractor.py added in this commit does just that.

It uses sqlalchemy SELECT statement for ORM Mapped Classes to extract the data from the database into equivalent ORM Mapped instances that we defined in src/macaron/database/table_definitions.py for example.

Maven and Gradle CLI Command Parser

We use the build commands obtained in CI/CD configuration (e.g. from github action workflow yaml file) for the final buildspec. However, those build commands cannot be used as is and they requires some additional patching to work as a rebuild command.

A proper way to patch any maven and gradle CLI build command is to first parse is. The maven and gradle CLI command parsers added in this commit leverage Python's builtin argparse library.

CLI Build Command Patcher

The modules added in this commit uses the Maven and GRadle CLI Command Parser to parse and patch the build commands obtained from the Macaron database.

Jdk version finding from java Maven Central artifacts

Macaron can obtain the JDK version for a given build command obtained from CI/CD configuration. In some cases, the CI/CD configuration doesn't have enough information for us to obtain the JDK version. Therefore, we also rely on the JDK version included in META-INF/MANIFEST.MF in java artifacts from Maven Central https://repo1.maven.org/.

The module jdk_finder.py added in this commit help download the java artifacts from Maven Central given a maven type PURL, then returns the JDK version if it is available in META-INF/MANIFEST.MF.

In some cases, the JDK version string from META-INF/MANIFEST.MF don't only contain the JDK major version. For example:

Because Reproducible Central Buildspec requires only the major version of JDK, we need to extract that major version only. The jdk_version_normalizer.py module contains the logic to do just that. It is added this in commit.

Generating the Reproducible Central Buildspec

The two commits

use all components listed above to generate the final Reproducible Central Buildspec

Testing

This feature includes unit tests for all components used in RC Buildspec generation (e.g. CLI parsers, CLI patchers, JDK version finder, etc.)

For integration tests, a new script called compare_rc_build_spec.py is added to compare the result Buildspec in the integration tests.

Checklist

  • I have reviewed the contribution guide.
  • My PR title and commits follow the Conventional Commits convention.
  • My commits include the "Signed-off-by" line.
  • I have signed my commits following the instructions provided by GitHub. Note that we run GitHub's commit verification tool to check the commit signatures. A green verified label should appear next to all of your commits on GitHub.
  • I have updated the relevant documentation, if applicable.
  • I have tested my changes and verified they work as expected.
  • Come up with proper patching values for Reproducible Central format. Right now it's empty.
  • Added integration tests for the feature. Make sure to test on Docker image.
  • Added unit test for the reproducible central build spec generation format
  • Revise all remaining TODOs in the PR
  • Rebase the branch to properly split the changes into different commit, making it easier for review.
  • Update the api docs in the Sphinx documentation (could be in a different PR)
  • Add a tutorial on how to support generating build spec of a different format (e.g. support new build tool patching, etc.) (could also be in a different PR)
  • Generate the build spec to a PURL based path instead of output/macaron.buildspec (could also be in a different PR).
  • Add a test case for prioritizing JDK version obtained from Maven Central JAR than the one from look up build command
  • Consider removing the "extra_comments" part in the generated build spec ?

@tromai tromai self-assigned this Jul 2, 2025
@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jul 2, 2025
@tromai tromai added feature A new feature request and removed OCA Verified All contributors have signed the Oracle Contributor Agreement. labels Jul 2, 2025
@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jul 2, 2025
return True


def compare_rc_build_spec(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the logic of this function is taken from https://github.com/oracle/macaron/blob/6a712af1ebdcb435bd5b7199dc1b4f5473663090/tests/vsa/compare_vsa.py

I think the comparing functions within compare_vsa.py could be refactored out into a tests_util.py module so that all "compare" scripts used in integration tests (here) could use if needed.
Please let me know if this should be done in a subsequent PR or this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either is fine for me.

@tromai tromai force-pushed the tromai/add-build-spec-generation branch 2 times, most recently from 2547333 to 3014540 Compare July 9, 2025 12:25
@tromai tromai changed the title feat: add build spec generation feat: add reproducible central buildspec generation Jul 9, 2025
@tromai tromai force-pushed the tromai/add-build-spec-generation branch from 3014540 to da329fe Compare July 9, 2025 12:34
@tromai tromai marked this pull request as ready for review July 9, 2025 13:21
@tromai tromai requested a review from behnazh-w as a code owner July 9, 2025 13:21
pformat(patches),
)

group = purl.namespace
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Reproducible Central Buildspec file needs the GAV coordinate of the component. This will only make sense for an input maven type PackageURL.
At the moment, we accept any type of PackageURL. Please let me know if it makes sense to enforce the input PackageURL to be of type maven when we are generating a Reproducible Central Buildspec ? (Ideally each type of Buildspec format might have different requirements on the input PURL)

@tromai
Copy link
Member Author

tromai commented Jul 15, 2025

Because we are providing the path to the database from CLI argument, we need to support mounting this database file into the container file system too.

@behnazh-w
Copy link
Member

@tromai Instead of storing the build spec in output/macaron.buildspec, what do you think about using the standard PURL-based path? That would help prevent overwrites and improve archiving outputs.

@tromai
Copy link
Member Author

tromai commented Jul 21, 2025

@behnazh-w I only use output/macaron.buildspec as I didn't really know a better way to implement the output path for this feature. This feature is similar to the VSA generation, in the sense that it generates one file from an input PURL.

I can definitely change it to a PURL based path format. The only downside of using a PURL based path here is that it would be a bit challenging for scripting in a CI/CD environment, if that is the targeted use case (you might need to compute the PURL based path from outside of Macaron).

@behnazh-w
Copy link
Member

This feature is similar to the VSA generation, in the sense that it generates one file from an input PURL.

I think the VSA feature is a little different because it is generated by the policy engine that enforces a given policy, which is not applied to a PURL. The subject instead is specified in the policy itself.

I can definitely change it to a PURL based path format. The only downside of using a PURL based path here is that it would be a bit challenging for scripting in a CI/CD environment, if that is the targeted use case (you might need to compute the PURL based path from outside of Macaron).

Yes that's true, but I think overall it's less error-prone and we avoid over-writing previous results.

@tromai tromai added the build_spec_generation Related to the build spec generation feature. label Jul 21, 2025
@tromai tromai force-pushed the tromai/add-build-spec-generation branch 2 times, most recently from c928fb1 to 4a4ac5d Compare August 8, 2025 06:48
@tromai tromai force-pushed the tromai/add-build-spec-generation branch from 4a4ac5d to 0ff4d6c Compare August 17, 2025 23:50
tromai added 5 commits August 19, 2025 09:51
Signed-off-by: Trong Nhan Mai <trong.nhan.mai@oracle.com>
Signed-off-by: Trong Nhan Mai <trong.nhan.mai@oracle.com>
Signed-off-by: Trong Nhan Mai <trong.nhan.mai@oracle.com>
Signed-off-by: Trong Nhan Mai <trong.nhan.mai@oracle.com>
Signed-off-by: Trong Nhan Mai <trong.nhan.mai@oracle.com>


@dataclass
class MavenSystemPropeties(OptionDef[dict[str, str | None]]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class MavenSystemPropeties(OptionDef[dict[str, str | None]]):
class MavenSystemProperties(OptionDef[dict[str, str | None]]):

class MavenGoalPhase(OptionDef[list[str]]):
"""This option represents the positional goal/plugin-phase option in Maven CLI command.

argparse.Namespace stores this as a list of string. This is stored internally as a list of string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of this option would still be useful.

Comment on lines +282 to +287
# TODO: we need to confirm whether one can provide
# -P or -pl multiple times and the values will be aggregate into a list of string
# The current implementation only consider one instance of -P or -pl.
# Where to begin:
# https://github.com/apache/maven/blob/maven-3.9.x/maven-embedder/src/main/java/org/apache/maven/cli/CLIManager.java
# https://github.com/apache/commons-cli/blob/master/src/main/java/org/apache/commons/cli/Parser.java
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this comment should be moved down a bit to where the related options are.

class MavenCLICommandParser:
"""A Maven CLI Command Parser."""

ACCEPTABLE_EXECUTABLE = ["mvn", "mvnw"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ACCEPTABLE_EXECUTABLE = ["mvn", "mvnw"]
ACCEPTABLE_EXECUTABLE = {"mvn", "mvnw"}

def apply_patch(
self,
cli_command: MavenCLICommand,
options_patch: Mapping[str, MavenOptionPatchValueType | None],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
options_patch: Mapping[str, MavenOptionPatchValueType | None],
patch_options: Mapping[str, MavenOptionPatchValueType | None],

I think this is actually more appropriate than options_patch. Also applies to the Gradle counterpart.

Comment on lines +169 to +170
ext: JavaArtifactExt
The extension of the main artifact file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ext: JavaArtifactExt
The extension of the main artifact file.

return get_jdk_version_from_jar(local_artifact_path)


def find_jdk_version_from_central_maven_repo(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things for this function and the things it relies on:

  • The Java extension enum seems to only be used here, and contains only a single value. Maybe it should just be hardcoded? Or stored not as an enum?
  • The caching strategy also seems to be somewhat static; however I see some value of having that option.
  • The two find_jdk... functions are very similar. I would suggest having a single entry point that diverges based on the caching logic (assuming that's being kept), as it is mostly the download part that differs.

Comment on lines +58 to +59
>>> normalize_jdk_version("25.0.1")
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a missing example result here.

>>> normalize_jdk_version("25.0.1")
"""
first, _, after = jdk_version_str.partition(".")
jdk_major_ver = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
jdk_major_ver = None


"""This module contains the logic to nomarlize a JDK version string to a major version number."""

SUPPORTED_JAVA_VERSION = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be a set.

return result


def compile_sqlite_select_statement(select_statment: Select) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def compile_sqlite_select_statement(select_statment: Select) -> str:
def compile_sqlite_select_statement(select_statement: Select) -> str:

Comment on lines +178 to +179
purl_string : str
The PackageURL string to find the BuildToolFacts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be updated to match the function parameter.


Returns
-------
Select[tuple[BuildAsCodeFacts]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Select[tuple[BuildAsCodeFacts]]
Select[tuple[BuildToolFacts]]

Comment on lines +220 to +221
purl_string : str
The PackageURL string to find the BuildToolFacts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be updated.

Comment on lines +267 to +268
purl_string : str
The PackageURL string to find the BuildServiceFacts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be updated.

Comment on lines +314 to +315
purl_string : str
The PackageURL string to find the BuildScriptFacts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be updated.

Comment on lines +398 to +399
purl_string : str
The PackageURL string to look for the BuildToolFacts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be updated.

Comment on lines +433 to +434
purl_string : str
The PackageURL string to look for the BuildAsCodeFacts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be updated.

Comment on lines +468 to +469
purl_string : str
The PackageURL string to look for the BuildServiceFacts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be updated.

Comment on lines +503 to +504
purl_string : str
The PackageURL string to look for the BuildScriptFacts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be updated.

SBT = "sbt"


def format_build_command_infos(build_command_infos: list[GenericBuildCommandInfo]) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def format_build_command_infos(build_command_infos: list[GenericBuildCommandInfo]) -> str:
def format_build_command_info(build_command_info: list[GenericBuildCommandInfo]) -> str:

Comment on lines +184 to +191
if fact.language in {"java"}:
try:
macaron_build_tool_name = _MacaronBuildToolName(fact.build_tool_name)
except ValueError:
continue

# TODO: What happen if we report multiple build tools in the database.
return macaron_build_tool_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if fact.language in {"java"}:
try:
macaron_build_tool_name = _MacaronBuildToolName(fact.build_tool_name)
except ValueError:
continue
# TODO: What happen if we report multiple build tools in the database.
return macaron_build_tool_name
if fact.language != "java":
continue
try:
macaron_build_tool_name = _MacaronBuildToolName(fact.build_tool_name)
except ValueError:
continue
# TODO: What happen if we report multiple build tools in the database.
return macaron_build_tool_name

an error, or no build command is found from the database.
"""
try:
lookup_build_command_infos = lookup_any_build_command(component_id, session)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lookup_build_command_infos = lookup_any_build_command(component_id, session)
lookup_build_command_info = lookup_any_build_command(component_id, session)

Comment on lines +418 to +424
lookup_build_command_jdk = (
get_lookup_build_command_jdk(
lookup_build_command_info,
)
if lookup_build_command_info
else None
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should only be executed if the jdk_from_jar is None.

Comment on lines +450 to +455
if not patched_build_commands:
logger.error(
"Failed to patch command sequences %s.",
[selected_build_command],
)
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the JDK discovery can rely on the build information via lookup_build_command_info, I'd suggest moving all of the build command related code before the JDK related code. Up to and including the highlighted block here.
Also might be a good idea to extract some of this out to standalone functions. E.g. JDK, Patched Build Commands, etc.

Comment on lines +100 to +107
def test_comparing_gradle_cli_command_unequal_types(
gradle_cli_parser: GradleCLICommandParser,
command: str,
that: Any,
) -> None:
"""Test comparing MavenCLICommand with another incompatible type oject."""
this_command = gradle_cli_parser.parse(command.split())
assert not this_command == that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this testing? It seems to be comparing a GradleCLICommand object against a bool, list, and set.

The result object.
expected : dict[str, str]
The expected object.
compare_fn_map : str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
compare_fn_map : str
compare_fn_map : dict[str, CompareFn]


def extract_data_from_build_spec(build_spec_path: str) -> dict[str, str] | None:
"""Extract data from build spec."""
original_build_spec_content = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
original_build_spec_content = None

break

if effective_cli_parser:
expected_cli_commands.append(cli_parser.parse(cmd))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expected_cli_commands.append(cli_parser.parse(cmd))
expected_cli_commands.append(effective_cli_parser.parse(cmd))

assert not lookup_any_build_command(component_id=1, session=macaron_db_session)


def test_invalid_input_databse(invalid_db_session: Session) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_invalid_input_databse(invalid_db_session: Session) -> None:
def test_invalid_input_database(invalid_db_session: Session) -> None:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build_spec_generation Related to the build spec generation feature. feature A new feature request OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants