Skip to content

Fix how Warehouse stores metadata (per-file, not per-release) #8090

Open

Description

Task list:

  • Add a FileMetadata model and an optional 1-1 relationship to the File model. The fields of this model should correspond to the available Core Metadata fields.
  • Add a metadata_source field to FileMetadata that lets us determine if the metadata is "provided" (i.e. from the POST) or "extracted" (i.e. from the artifact itself).
  • Modify the upload endpoint to start creating FileMetadata objects on new uploads, discerning how we got the metadata (wheels should always be "extracted", source distributions will be "provided")
  • Write a task to backfill FileMetadata files for wheels that have already been uploaded with data from our metadata backfill files (Remove metadata backfill task #15526 will be helpful here)
  • Create a way to select the "metadata source" for a release (the oldest file for that release)
  • Drop all the various metadata fields on the Release model and update the UI to get them from the "metadata source" instead.

Original issue

Describe the bug
Warehouse's API gives the user sometimes inaccurate dependency metadata for a project's release, because:

  • we only store the dependency information per Release, but it can vary per file.
  • first file uploaded creates the Release object. which is also problematic -- if the user uploads a source distribution first, no dependency information is encoded for the Release

Expected behavior
As I understand it, we should change how we store and provide dependency information, recording and storing it per file instead of per release. I presume this means that the requires_dist field within the release endpoint would move from the "info" value to the individual "releases" values.

To Reproduce
Sorry, I don't have one to hand.

Additional context
Quoting a conversation in IRC today between @dstufft and @techalchemy (quotes are condensed and edited for grammar):

@dstufft said of the current Warehouse JSON API, "I don't think it's usable in it's current form for resolving dependencies". Regarding the metadata available, which clients would otherwise need to download packages to acquire,
"the data is wrong is the main thing ... for dep info .... because warehouse (and originally PyPI's data model) is wrong. We only store the dependency information per Release, but it can vary per file."

@techalchemy asked: "so which file do you pick for parsing dependencies? the first wheel that gets uploaded? Or the last one?"
@dstufft: "first file uploaded creates the Release object. which is also problematic, if you upload a sdist first no dependency information is encoded.... At one point only twine worked to upload dependency information. If you uploaded with setuptools it didn't get sent no matter what."

Donald also noted, on parseability of that info, "We [Warehouse] do not currently parse anything inside of a wheel, in part because we never did, in part because upload already takes forever and the more stuff we do the longer it takes. I think our timeout on upload is multiple minutes, because that's how long it takes sometimes." (That's a reason for #7730 but we should not block on that.)

"We might want to tweak the JSON API a bit just to make it suitable for the primary use case I think people want it for, and when I say tweak, I basically mean add a field or two to a dict inside of alist"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions