Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

conda crawler implementation #532

Merged
merged 17 commits into from
May 2, 2024
Merged

Conversation

lamarrr
Copy link

@lamarrr lamarrr commented Nov 15, 2023

closes #535

This Merge request is intended to track the work in implementing the crawler for conda source packages.

Background

Conda exposes packages in a different format from other Python repositories like pypi. Conda is a Python environment locked to a specific Python version.
Conda deals with packages locked to a specific version for a version of the channel, this ensures packages do not break due to one incompatibility or another as the packages are managed for compatibility, similar to how you'd ship a docker container.
The primary consumption point is the "packages" themselves which are accompanied with scripts to modify the environment and setup the packages and dependencies which are then consumed by the setup application, the packages may also contain DLLs, scripts, compiled Python binary (.pyc), python code. etc.
The structure of conda repositories and their indexing process is described here: https://docs.conda.io/projects/conda-build/en/stable/concepts/generating-index.html

Conda has three main channels: anaconda-main, anaconda-r, and conda-forge which is more geared toward business uses

We crawl both the packages and the source code (not always specified) for the licensing metadata and other metadata about the package.

the source from which the conda packages are created is often but not always provided via a URL that links a compressed source file hosted externally, sometimes via GitHub, or another website. note that this is a file and not a git repository.
the main conda package is hosted on the conda channels themselves and is compressed and contains necessary licensing information, compilers, environment configuration scripts, dependencies, etc. that are needed to make the package work.

The crawler uses the coordinates of the syntax:

type: conda | condasource
provider: conda-forge | anaconda-main | anaconda-r
namespace: ${architecture}
name: any
revision:  (${version} |  )-(${buildversion} |  )

i.e.

conda/conda-forge/linux-aarch64/numpy/1.13.0
condasource/conda-forge/linux-aarch64/numpy/1.13.0
conda/conda-forge/-/numpy/1.13.0/
conda/conda-forge/linux-aarch64/numpy/-py36

where
type (required): conda or condasource
namespace (optional): architecture and OS of the package to be crawled i.e. win64, linux-aarch64, if no architecture is specified any architecture is chosen.
package name: name of the package
provider (required): channel on which the package will be crawled. conda-forge, anaconda-main, or anaconda-r
revision (optional): package version and optional build version i.e. 0.3.0, 0.3.0-py36hffe2fc. if it is a conda coordinate type and the build version of the package is usually a conda-specific representation of the build tools and environment configuration, and build iteration of the package. i.e. for a Python 3.9 environment, this could be py39H443E.
if none is specified, the latest one will be selected using the package's timestamp.

Conda-forge is a community effort and packages are published by opening PRs on their GitHub repository as described here https://conda-forge.org/docs/maintainer/adding_pkgs.html

@lamarrr lamarrr marked this pull request as ready for review November 20, 2023 11:34
@lamarrr lamarrr force-pushed the conda-support branch 3 times, most recently from 308236f to 55a77d6 Compare November 29, 2023 15:26
@lamarrr lamarrr force-pushed the conda-support branch 2 times, most recently from 2685fe6 to 29df6dd Compare December 5, 2023 17:21
update

fixed license extraction process and implemented top packages extraction

fixed CI warnings

fixed ci warnings

fixed fetcher handle check

finished top command implementation

fixed unused variable error

fixed log info

handled unspecified revision

bug fixes

fixed lint errors

update

update

implemented fetch tests

updated fixtures hashes

fixed date format

fixed license declaration test

finished

added extract tests

update

update

fixed test

fixed test

update

update

update

update

update

update
@qtomlinson
Copy link
Collaborator

qtomlinson commented Jan 4, 2024

It is exciting to see that a new harvester is being implemented! This pull request provides a solid foundation for future enhancements. A discussion is needed on the proposed coordinates, e.g. conda/conda-forge/-/numpy/1.13.0_linux-aarch64/py36. In the above proposal, toolVersion is used to represent the build (string) in conda. Points to consider:

  • toolVersion is already used internally by ClearlyDefined as the harvest tool versioning. Using spec.toolVersion will cause conflicts.
  • Adding /py36 adds complexity to service APIs. For instance, harvest data api expects
    /harvest/{type}/{provider}/{namespace}/{name}/{revision}/{tool}.
  • Concatenating version and architecture with _ does not handle versions that contain _, e.g. "version": "1.30.0_2018_09_30"

One possible alternative is to mirror the package search standard specification in the CD's coordinates.
image

The mapping would be as follows:

channel -> provider
subdir -> namespace
name -> name
`${version}-${build}` -> revision 

Both version and build may not contain "-" (see https://conda.io/projects/conda/en/latest/user-guide/concepts/pkg-specs.html#info-index-json). So using "-" as separator works here.

Additionally, when architecture platform is not specified, should 'noarch' be considered as the default value (https://docs.anaconda.com/anaconda-repository/user-guide/tasks/pkgs/use-noarch-pkgs/)? This likely produces more predictive results.

@elrayle @capfei @jeffwilcox @mpcen @pombredanne @bduranc I am not particularly knowledgeable in the conda ecosystem, and would very much appreciate other experts' input.

@bduranc
Copy link

bduranc commented Jan 4, 2024

I'm okay with both proposals. Qing's is in-line with Conda's own standards, which I assume are applicable to all three channels.

I also assume the type coordinate in both proposals would still be "conda" or "condasource". Also, if there is a GH or GitLab source location present in the channel / package metadata, is the intent also to populate the same in the definition?

In the past, I believe we've tried our best to keep new provider coordinate formats consistent with the others (for example, when we added Debian/debsrc support back a few years ago, I believe that took some influence from the Maven implementation). It would be best to follow that practice as much as we can here too.

@lamarrr
Copy link
Author

lamarrr commented Jan 5, 2024

toolVersion is already used internally by ClearlyDefined as the harvest tool versioning. Using spec.toolVersion will cause conflicts.

I had the impression toolVersion was referring to the tool the package was built with and not of the tooling scanning the licensing info. which I feel would have been less ambiguous if it was at the beginning of the package coordinate.

Adding /py36 adds complexity to service APIs. For instance, harvest data api expects
/harvest/{type}/{provider}/{namespace}/{name}/{revision}/{tool}.

Agreed, I'll make it an optional parameter appended to the revision instead

Concatenating version and architecture with _ does not handle versions that contain _, e.g. "version": "1.30.0_2018_09_30"

Agreed, but I couldn't find a better delimiter to use. I performed a regex search on some of the channels and none of them had that kind of versioning (with '_' in them). it's always semantic versioning (numbers and hyphens only with alpha/beta, https://semver.org/).

The new revision can be {architecture}--{version}-{build}.

since architecture can be linux-64.

I feel noarch isn't the right thing. noarch is for platform-agnostic packages which may or may not be present. i.e. 7zip isn't platform-agnostic but is architecture and os dependent so it is not on the noarch list which would lead to fetching it without specifying the architecture to fail.
We presently select randomly from any architecture the package is available on (just as is done on the debian fetcher) which I feel is a much more reliable method than using noarch by default. It might be better to make the subdir/architecture&os required than using noarch.

asides, subdir isn't really a namespace, it's just an architecture and os folder grouping of the packages (i.e. linux x64 packages -> /linux-64, windows x64 packages -> /windows-64)

@lamarrr
Copy link
Author

lamarrr commented Jan 5, 2024

I also assume the type coordinate in both proposals would still be "conda" or "condasource". Also, if there is a GH or GitLab source location present in the channel / package metadata, is the intent also to populate the same in the definition?

I don't get you, what definition?
If it is a condasource type, it is sourced from whatever source url or source git url (git is NOT always the source) is provided at the package's channel index.
if it is a conda type (architecture-dependent) it is sourced from conda's server source url

@bduranc
Copy link

bduranc commented Jan 5, 2024

I also assume the type coordinate in both proposals would still be "conda" or "condasource". Also, if there is a GH or GitLab source location present in the channel / package metadata, is the intent also to populate the same in the definition?

I don't get you, what definition? If it is a condasource type, it is sourced from whatever source url or source git url (git is NOT always the source) is provided at the package's channel index. if it is a conda type (architecture-dependent) it is sourced from conda's server source url

Basically, a "definition" == component in ClearlyDefined.

Using Maven as an example:
https://clearlydefined.io/definitions/maven/mavencentral/com.googlecode.openbox/maventools/2.0.1

and it's corresponding "source" definition (Maven sourcearchive): https://clearlydefined.io/definitions/sourcearchive/mavencentral/com.googlecode.openbox/maventools/2.0.1

Or another example that has a GitHub repo maintained as it's source location field instead of the Maven sourcearchive: https://clearlydefined.io/definitions/maven/mavencentral/io.eliez/mavenJava/2.0.1

fixed test

fixed lint
@lamarrr
Copy link
Author

lamarrr commented Jan 8, 2024

I have now changed the delimiters and coordinate specification to:

{type: conda|condasource}/{provider: anaconda-main|anaconda-r|conda-forge}/-/{package name}/[{archictecture | _}:][{version | _}]-[{build version | _}]/[{tool version}]
conda/conda-forge/-/numpy/linux-aarch64:1.13.0-py36/ - complete coordinate
conda/conda-forge/-/numpy/-py36/ -- any version with build hash py36*
conda/conda-forge/-/numpy/1.13.0-py36/ -- version with build hash
conda/conda-forge/-/numpy/linux-aarch64:_-py36/ -- architecture and build hash
conda/conda-forge/-/numpy/linux-aarch64:1.13.0/ -- architecture and version
conda/conda-forge/-/numpy/ - any
conda/conda-forge/-/numpy/_:_-_ - any

@lamarrr
Copy link
Author

lamarrr commented Jan 8, 2024

I have now changed the delimiters and coordinate specification to:

{type: conda|condasource}/{provider: anaconda-main|anaconda-r|conda-forge}/-/{package name}/[{archictecture | _}:][{version | _}]-[{build version | _}]/[{tool version}]
conda/conda-forge/-/numpy/linux-aarch64:1.13.0-py36/ - complete coordinate
conda/conda-forge/-/numpy/-py36/ -- any version with build hash py36*
conda/conda-forge/-/numpy/1.13.0-py36/ -- version with build hash
conda/conda-forge/-/numpy/linux-aarch64:_-py36/ -- architecture and build hash
conda/conda-forge/-/numpy/linux-aarch64:1.13.0/ -- architecture and version
conda/conda-forge/-/numpy/ - any
conda/conda-forge/-/numpy/_:_-_ - any

Seems the file indexer encodes the coordinates into files directly from the coordinate spec, meaning paths separated with : might not work. I have changed the spec to this:

{type: conda|condasource}/{provider: anaconda-main|anaconda-r|conda-forge}/-/{package name}/[{archictecture | _}--][{version | _}]-[{build version | _}]/[{tool version}]
conda/conda-forge/-/numpy/linux-aarch64--1.13.0-py36/ - complete coordinate
conda/conda-forge/-/numpy/-py36/ -- any version with build hash py36*
conda/conda-forge/-/numpy/1.13.0-py36/ -- version with build hash
conda/conda-forge/-/numpy/linux-aarch64--_-py36/ -- architecture and build hash
conda/conda-forge/-/numpy/linux-aarch64--1.13.0/ -- architecture and version
conda/conda-forge/-/numpy/ - any
conda/conda-forge/-/numpy/_--_-_ - any

Copy link
Collaborator

@qtomlinson qtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes as discussed!
I have reservations on treating condasrc as binary packages in recent commits(reasons below).
Cons for condasrc as coordinates for binary base package:

  • hetrogeneous, reference based, not systematic
  • not unique. For example, different condasrc coordinates point to the same thing (source_url) original package
    condasrc/conda-forge/linux-64/pyarrow/11.0.0
    condasrc/conda-forge/linux-64/arrow-cpp/11.0.0
  • overlap with other coordinate systems, binary package can be described using its native coordinates.
    e.g. instead of using condasrc/anaconda-main/linux-64/adal/1.2.7, use coordinates pypi/pypi/-/adal/1.2.7
  • Based on stats from conda-forge channeldata.json close to 88% comes from pypi and github. Introducing another binary package describing packages from different systems, e.g. ruby gem, does not seem worthwhile

Suggestions on condaFetch, condaExtract, and top below.

providers/fetch/condaFetch.js Outdated Show resolved Hide resolved
providers/fetch/condaFetch.js Outdated Show resolved Hide resolved
providers/fetch/condaFetch.js Outdated Show resolved Hide resolved
providers/fetch/condaFetch.js Outdated Show resolved Hide resolved
providers/fetch/condaFetch.js Outdated Show resolved Hide resolved
providers/fetch/condaFetch.js Show resolved Hide resolved
providers/fetch/condaFetch.js Outdated Show resolved Hide resolved
providers/process/condaExtract.js Outdated Show resolved Hide resolved
providers/process/condaExtract.js Outdated Show resolved Hide resolved
providers/process/top.js Outdated Show resolved Hide resolved
@lamarrr lamarrr force-pushed the conda-support branch 2 times, most recently from 6cd19a5 to f46187d Compare January 22, 2024 15:38
fixed checksum

removed redundant check

fixed checksum

fixed tests

added more tests

fix test

improved tests

update

updated tests

fixed test

refactoring

update
config/cdConfig.js Outdated Show resolved Hide resolved
providers/process/condaSrcExtract.js Show resolved Hide resolved
providers/process/top.js Outdated Show resolved Hide resolved
providers/process/condaExtract.js Outdated Show resolved Hide resolved
providers/process/condaExtract.js Outdated Show resolved Hide resolved
providers/fetch/condaFetch.js Show resolved Hide resolved
providers/fetch/condaFetch.js Outdated Show resolved Hide resolved
providers/fetch/condaFetch.js Outdated Show resolved Hide resolved
changed condafetch date to iso format

fixed test

update tests

update
Copy link
Collaborator

@qtomlinson qtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution and incorporating the comments!

providers/fetch/condaFetch.js Show resolved Hide resolved
@lamarrr lamarrr force-pushed the conda-support branch 2 times, most recently from dadbd90 to 73c5d96 Compare January 25, 2024 11:07
@qtomlinson qtomlinson self-requested a review January 26, 2024 02:48
fixed tests

fixed tests

update

[conda] fixed version selection

[conda] changed versioning

[conda] removed unused variable
@capfei
Copy link
Member

capfei commented Jan 29, 2024

I don't have much knowledge in the crawler and conda but was able to get it running locally. This looks good to me and fine with the naming convention.

@qtomlinson qtomlinson merged commit 5cfbdd5 into clearlydefined:master May 2, 2024
1 check passed
@RazaAli99
Copy link

Hi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Conda Crawler Support
5 participants