-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
conda crawler implementation #532
Conversation
07daf3c
to
f3b16cd
Compare
308236f
to
55a77d6
Compare
2685fe6
to
29df6dd
Compare
update fixed license extraction process and implemented top packages extraction fixed CI warnings fixed ci warnings fixed fetcher handle check finished top command implementation fixed unused variable error fixed log info handled unspecified revision bug fixes fixed lint errors update update implemented fetch tests updated fixtures hashes fixed date format fixed license declaration test finished added extract tests update update fixed test fixed test update update update update update update
29df6dd
to
3dcac65
Compare
It is exciting to see that a new harvester is being implemented! This pull request provides a solid foundation for future enhancements. A discussion is needed on the proposed coordinates, e.g. conda/conda-forge/-/numpy/1.13.0_linux-aarch64/py36. In the above proposal, toolVersion is used to represent the build (string) in conda. Points to consider:
One possible alternative is to mirror the package search standard specification in the CD's coordinates. The mapping would be as follows:
Both version and build may not contain "-" (see https://conda.io/projects/conda/en/latest/user-guide/concepts/pkg-specs.html#info-index-json). So using "-" as separator works here. Additionally, when architecture platform is not specified, should 'noarch' be considered as the default value (https://docs.anaconda.com/anaconda-repository/user-guide/tasks/pkgs/use-noarch-pkgs/)? This likely produces more predictive results. @elrayle @capfei @jeffwilcox @mpcen @pombredanne @bduranc I am not particularly knowledgeable in the conda ecosystem, and would very much appreciate other experts' input. |
I'm okay with both proposals. Qing's is in-line with Conda's own standards, which I assume are applicable to all three channels. I also assume the type coordinate in both proposals would still be "conda" or "condasource". Also, if there is a GH or GitLab source location present in the channel / package metadata, is the intent also to populate the same in the definition? In the past, I believe we've tried our best to keep new provider coordinate formats consistent with the others (for example, when we added Debian/debsrc support back a few years ago, I believe that took some influence from the Maven implementation). It would be best to follow that practice as much as we can here too. |
I had the impression toolVersion was referring to the tool the package was built with and not of the tooling scanning the licensing info. which I feel would have been less ambiguous if it was at the beginning of the package coordinate.
Agreed, I'll make it an optional parameter appended to the revision instead
Agreed, but I couldn't find a better delimiter to use. I performed a regex search on some of the channels and none of them had that kind of versioning (with '_' in them). it's always semantic versioning (numbers and hyphens only with alpha/beta, https://semver.org/). The new revision can be since architecture can be linux-64. I feel noarch isn't the right thing. noarch is for platform-agnostic packages which may or may not be present. i.e. 7zip isn't platform-agnostic but is architecture and os dependent so it is not on the noarch list which would lead to fetching it without specifying the architecture to fail. asides, subdir isn't really a namespace, it's just an architecture and os folder grouping of the packages (i.e. linux x64 packages -> /linux-64, windows x64 packages -> /windows-64) |
I don't get you, what definition? |
Basically, a "definition" == component in ClearlyDefined. Using Maven as an example: and it's corresponding "source" definition (Maven sourcearchive): https://clearlydefined.io/definitions/sourcearchive/mavencentral/com.googlecode.openbox/maventools/2.0.1 Or another example that has a GitHub repo maintained as it's source location field instead of the Maven sourcearchive: https://clearlydefined.io/definitions/maven/mavencentral/io.eliez/mavenJava/2.0.1 |
fixed test fixed lint
I have now changed the delimiters and coordinate specification to:
|
Seems the file indexer encodes the coordinates into files directly from the coordinate spec, meaning paths separated with
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the changes as discussed!
I have reservations on treating condasrc as binary packages in recent commits(reasons below).
Cons for condasrc as coordinates for binary base package:
- hetrogeneous, reference based, not systematic
- not unique. For example, different condasrc coordinates point to the same thing (source_url) original package
condasrc/conda-forge/linux-64/pyarrow/11.0.0
condasrc/conda-forge/linux-64/arrow-cpp/11.0.0 - overlap with other coordinate systems, binary package can be described using its native coordinates.
e.g. instead of using condasrc/anaconda-main/linux-64/adal/1.2.7, use coordinates pypi/pypi/-/adal/1.2.7 - Based on stats from conda-forge channeldata.json close to 88% comes from pypi and github. Introducing another binary package describing packages from different systems, e.g. ruby gem, does not seem worthwhile
Suggestions on condaFetch, condaExtract, and top below.
6cd19a5
to
f46187d
Compare
fixed checksum removed redundant check fixed checksum fixed tests added more tests fix test improved tests update updated tests fixed test refactoring update
f46187d
to
0408e96
Compare
afec505
to
dd875fa
Compare
changed condafetch date to iso format fixed test update tests update
c5723b3
to
676ac80
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution and incorporating the comments!
dadbd90
to
73c5d96
Compare
fixed tests fixed tests update [conda] fixed version selection [conda] changed versioning [conda] removed unused variable
2428b65
to
b3bf591
Compare
I don't have much knowledge in the crawler and conda but was able to get it running locally. This looks good to me and fine with the naming convention. |
Hi |
closes #535
This Merge request is intended to track the work in implementing the crawler for conda source packages.
Background
Conda exposes packages in a different format from other Python repositories like pypi. Conda is a Python environment locked to a specific Python version.
Conda deals with packages locked to a specific version for a version of the channel, this ensures packages do not break due to one incompatibility or another as the packages are managed for compatibility, similar to how you'd ship a docker container.
The primary consumption point is the "packages" themselves which are accompanied with scripts to modify the environment and setup the packages and dependencies which are then consumed by the setup application, the packages may also contain DLLs, scripts, compiled Python binary (.pyc), python code. etc.
The structure of conda repositories and their indexing process is described here: https://docs.conda.io/projects/conda-build/en/stable/concepts/generating-index.html
Conda has three main channels: anaconda-main, anaconda-r, and conda-forge which is more geared toward business uses
We crawl both the packages and the source code (not always specified) for the licensing metadata and other metadata about the package.
the source from which the conda packages are created is often but not always provided via a URL that links a compressed source file hosted externally, sometimes via GitHub, or another website. note that this is a file and not a git repository.
the main conda package is hosted on the conda channels themselves and is compressed and contains necessary licensing information, compilers, environment configuration scripts, dependencies, etc. that are needed to make the package work.
The crawler uses the coordinates of the syntax:
i.e.
where
type (required): conda or condasource
namespace (optional): architecture and OS of the package to be crawled i.e. win64, linux-aarch64, if no architecture is specified any architecture is chosen.
package name: name of the package
provider (required): channel on which the package will be crawled. conda-forge, anaconda-main, or anaconda-r
revision (optional): package version and optional build version i.e. 0.3.0, 0.3.0-py36hffe2fc. if it is a conda coordinate type and the build version of the package is usually a conda-specific representation of the build tools and environment configuration, and build iteration of the package. i.e. for a Python 3.9 environment, this could be py39H443E.
if none is specified, the latest one will be selected using the package's timestamp.
Conda-forge is a community effort and packages are published by opening PRs on their GitHub repository as described here https://conda-forge.org/docs/maintainer/adding_pkgs.html