Skip to content

Conversation

raulcd
Copy link
Member

@raulcd raulcd commented Sep 19, 2025

TBD

Rationale for this change

What changes are included in this PR?

  • Move check_labels and report_ci jobs to independent reusable workflows.
  • Update cpp_extra to use those.
  • Create new linux_packaging.yml workflow replicating work that was done on crossbow. Integrate that workflow with check_labels and report_ci
  • Update release binary submit and binary download to run workflow and download the artifacts from the job instead of from the crossbow repository.

Are these changes tested?

Some via CI on fork and some manual testing.

Are there any user-facing changes?

No

Copy link

⚠️ GitHub issue #47582 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Sep 19, 2025
Copy link
Member Author

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kou I've started working on this and I think I have a working base that could work. I wanted to discuss it with you before moving more packages to validate we are all ok with the approach.
Let me know your thoughts. I have a couple of questions to discuss too, let me know what you think on those too.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Sep 19, 2025
uses: ./.github/workflows/check_labels.yml
secrets: inherit
with:
label: "CI: Extra"
Copy link
Member Author

@raulcd raulcd Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to move these labels to either:

  • CI Extra: cpp
  • CI Extra: CPP
  • CI: cpp
  • CI: CPP

So we can have the others the same like:

  • CI Extra: linux-packaging
  • CI Extra: Linux Packaging
  • CI: linux-packaging
  • CI: Linux Packaging

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like CI: Extra: C++/CI: Extra: Package: Linux or CI: cpp/CI: package: linux/CI: package: linux: apt.

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Sep 19, 2025
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

uses: ./.github/workflows/check_labels.yml
secrets: inherit
with:
label: "CI: Extra"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like CI: Extra: C++/CI: Extra: Package: Linux or CI: cpp/CI: package: linux/CI: package: linux: apt.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Sep 19, 2025
@raulcd raulcd force-pushed the GH-47582 branch 3 times, most recently from 1121d15 to 78e0ff2 Compare September 22, 2025 10:35
Copy link
Member Author

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone over all the comments and I think this is ready for another review.

Comment on lines 1654 to 1655
# apache/arrow-adbc uses debian-bookworm. So the following
# glob must much both of them.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is adbc still using this binary tasks? in that case we probably want to maintain the previous behaviour for the task too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

But I can change ADBC to simplify this script.

# glob must much both of them.
Dir.glob("#{source_dir_prefix}*/*") do |path|
base_name = File.basename(path)
destination_prefix = "#{incoming_dir}/#{distribution}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou I've never done anything on ruby (I know, I should learn a little 😃 ). I've tested these tasks locally (apt:rc:copy and yum:rc:copy) with a small integration test calling rake directly and seems to be working as expected but I would appreciate some help on those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing Ruby!

Sure. I can help. Can I push something to this branch if it's needed?

BTW, destination_dir is better for this variable name because this is a directory name not a prefix of path. (In this code, "prefix" is used for abc for abcde directory and "dir" is used for abcde directory.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can push to this branch if needed! that's completely fine with me. I am trying to simplify the folder structure as you suggested so the binary tasks are even simpler. Thanks, it's good that I am learning a little bit more about our linux packages so I can help you maintain those in the future.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Sep 23, 2025
Comment on lines 1654 to 1655
# apache/arrow-adbc uses debian-bookworm. So the following
# glob must much both of them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

But I can change ADBC to simplify this script.

run: |
set -ex
pushd dev/tasks/linux-packages
tar cvzf ${{ matrix.id }}.tar.gz */${TASK_NAMESPACE}/repositories
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's simplify paths more to simplify binary-task.rb:

Suggested change
tar cvzf ${{ matrix.id }}.tar.gz */${TASK_NAMESPACE}/repositories
mkdir -p artifacts
cp -a */${TASK_NAMESPACE}/repositories/* artifacts/
tar cvzf ${{ matrix.id }}.tar.gz -C artifacts .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to simplify the paths slightly more but I am finding quite difficult to understand how all tangles together. How do we generate the repositories from those rpms, how we merge these files with other existing files from the repo, how do we create the testing repositories, etcetera so I am reverting my latest changes. I am happy to leave it like this (current generated structure) and try to simplify on a future PR or if you want to guide me on how to do it or if you want to push to this branch.


# Wait for the GitHub Workflow that creates the Linux packages
# to finish before downloading the artifacts.
. "${SOURCE_DIR}/utils-watch-gh-workflow.sh" "${release_tag}" "package_linux.yml"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need . here because all information is passed via arguments:

Suggested change
. "${SOURCE_DIR}/utils-watch-gh-workflow.sh" "${release_tag}" "package_linux.yml"
"${SOURCE_DIR}/utils-watch-gh-workflow.sh" "${release_tag}" "package_linux.yml"

# glob must much both of them.
Dir.glob("#{source_dir_prefix}*/*") do |path|
base_name = File.basename(path)
destination_prefix = "#{incoming_dir}/#{distribution}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing Ruby!

Sure. I can help. Can I push something to this branch if it's needed?

BTW, destination_dir is better for this variable name because this is a directory name not a prefix of path. (In this code, "prefix" is used for abc for abcde directory and "dir" is used for abcde directory.)

# Wait for the GitHub Workflow that creates the GitHub Release
# to finish before updating the release notes.
"${SOURCE_DIR}/utils-watch-gh-workflow.sh" "${TAG}" "${WORKFLOW}"
. "${SOURCE_DIR}/utils-watch-gh-workflow.sh" "${TAG}" "${WORKFLOW}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need . here.

Comment on lines 43 to 50
download_artifacts() {
RUN_ID=$(get_run_id)
echo "Downloading artitfacts for workflow with ID: ${RUN_ID}"
gh run download \
${RUN_ID} \
--repo "${REPOSITORY}" \
--dir "$1"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Defining this function in this file isn't good because this file name is watch-gh-workflow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could rename the file to utils-gh-workflow.sh but if we use release for the artifacts this won't be necessary.

# to finish before downloading the artifacts.
. "${SOURCE_DIR}/utils-watch-gh-workflow.sh" "${release_tag}" "package_linux.yml"

RUN_ID=$(get_run_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you want to use a function defined in utils-watch-gh-workflow.sh.


RUN_ID=$(get_run_id)
# Download the artifacts created by the package_linux.yml workflow
download_artifacts "${SOURCE_DIR}/../../packages/${CROSSBOW_JOB_ID}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using GitHub Release for RC instead of GitHub Actions artifacts? If we use GitHub Release, we don't need RUN_ID: gh release download apache-arrow-${VERSION}-rc${RC} ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you mean, for scheduled or PR/push use artifacts but if workflow_dispatch was used, which is trigger of RC, upload to release? I don't want to create a release every time the job is executed but that could work. In that case we might use artifacts to test and upload to release if tests are successful and job type was workflow_dispatch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to upload artifacts to GitHub Release only when apache-arrow-${VERSION}-rc${RC} tag is pushed. Can we use on.push.tags instead of workflow_dispatch for RC? (I can provide changes for the approach.)

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Sep 23, 2025
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed changes for the suggested workflow.

I'll fix CI failures and then test release workflow by pushing a RC tag to fork repository.

def detect_repo
detect_env("REPO")
def github_repository
ENV["GITHUB_REPOSITORY"] || "apache/arrow"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub Actions provides GITHUB_REPOSITORY environment variable by default. So this uses fork's GitHub Container registry on fork automatically.

repository="${repositories}/${distribution}/${distribution_version}"
rpm_dir="${repository}/${architecture}/Packages"
srpm_dir="${repository}/source/SRPMS"
srpm_dir="${repository}/Source/Packages"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

["almalinux", "8"],
["amazon-linux", "2023"],
["centos", "9-stream"],
["centos", "8-stream"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already dropped support for CentOS Stream 8: #46953

Comment on lines +1657 to 1660
Dir.glob("#{source_dir_prefix}*.tar.gz") do |tar_gz|
sh("tar", "xf", tar_gz, "-C", incoming_dir)
progress_reporter.advance
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use this for ADBC too but we need some changes in ADBC: apache/arrow-adbc#3510

# Trigger workflow when a tag whose name matches the pattern
# "apache-arrow-{MAJOR}.{MINOR}.{PATCH}-rc{RC_NUM}" is pushed.
- "apache-arrow-[0-9]+.[0-9]+.[0-9]+-rc[0-9]+"
- "apache-arrow-*-rc*"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify and unify tag patterns.

Comment on lines 220 to 247
# We use latest .deb/.rpm of
# apache-arrow-apt-source/apache-arrow-release built for
# amd64 because they are architecture independent.
if [ "${ARCHITECTURE}" = "amd64" ]; then
if [ "${APT_TASK_NAMESPACE}" = "apt" ]; then
# Example: debian-bookworm-amd64 -> debian-bookworm
code_name="${ID%-*}"
# Example: debian-bookworm -> bookworm
code_name="${code_name#*-}"
# Create
# https://packages.apache.org/artifactory/arrow/${DISTRIBUTION}/apache-arrow-apt-source-latest-${code_name}.deb
# for easy to install.
cp -a \
"${DISTRIBUTION}/pool/${code_name}/a/apache-arrow-apt-source/*.deb \
"${DISTRIBUTION}/apache-arrow-apt-source-latest-${code_name}.deb"
else
# Example: amazon-linux-2023-amd64 -> amazon-linux-2023
version="${ID%-*}"
# Example: amazon-linux-2023 -> 2023
version="${version##*-}"
# Create
# https://packages.apache.org/artifactory/arrow/${DISTRIBUTION}/${version}/apache-arrow-release-latest.rpm
# for easy to install.
cp -a \
"${DISTRIBUTION}/${version}/x86_64/Packages/apache-arrow-release-*.rpm \
"${DISTRIBUTION}/${version}/apache-arrow-release-latest.rpm"
fi
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved from binary-task.rb.

We'll create latest .deb/.rpm here to simplify binary-task.rb.

if [ "${GITHUB_REF_TYPE}" = "tag" ]; then
# Example: apache-arrow-21.0.0-rc0 -> 21.0.0-rc0
version="${GITHUB_REF_NAME#apache-arrow-}"
echo "ARROW_VERSION=${version}" >> "${GITHUB_ENV}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use 21.0.0-rc0 not 21.0.0 here. The -rc0 part will be removed by

when /-((dev|rc)\d+)\z/
base_version = $PREMATCH
sub_version = $1
type = $2
if type == "rc" and options[:rc_build_type] == :release
@deb_upstream_version = base_version
@deb_archive_base_name_version = base_version
@rpm_version = base_version
@rpm_release = "1"
.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 2, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting change review Awaiting change review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants