Description
openedon Jun 10, 2020
Task list:
- Add a
FileMetadata
model and an optional 1-1 relationship to theFile
model. The fields of this model should correspond to the available Core Metadata fields. - Add a
metadata_source
field toFileMetadata
that lets us determine if the metadata is "provided" (i.e. from thePOST
) or "extracted" (i.e. from the artifact itself). - Modify the upload endpoint to start creating
FileMetadata
objects on new uploads, discerning how we got the metadata (wheels should always be "extracted", source distributions will be "provided") - Write a task to backfill
FileMetadata
files for wheels that have already been uploaded with data from our metadata backfill files (Remove metadata backfill task #15526 will be helpful here) - Create a way to select the "metadata source" for a release (the oldest file for that release)
- Drop all the various metadata fields on the
Release
model and update the UI to get them from the "metadata source" instead.
Original issue
Describe the bug
Warehouse's API gives the user sometimes inaccurate dependency metadata for a project's release, because:
- we only store the dependency information per Release, but it can vary per file.
- first file uploaded creates the Release object. which is also problematic -- if the user uploads a source distribution first, no dependency information is encoded for the Release
Expected behavior
As I understand it, we should change how we store and provide dependency information, recording and storing it per file instead of per release. I presume this means that the requires_dist
field within the release endpoint would move from the "info" value to the individual "releases" values.
To Reproduce
Sorry, I don't have one to hand.
Additional context
Quoting a conversation in IRC today between @dstufft and @techalchemy (quotes are condensed and edited for grammar):
@dstufft said of the current Warehouse JSON API, "I don't think it's usable in it's current form for resolving dependencies". Regarding the metadata available, which clients would otherwise need to download packages to acquire,
"the data is wrong is the main thing ... for dep info .... because warehouse (and originally PyPI's data model) is wrong. We only store the dependency information per Release, but it can vary per file."
@techalchemy asked: "so which file do you pick for parsing dependencies? the first wheel that gets uploaded? Or the last one?"
@dstufft: "first file uploaded creates the Release object. which is also problematic, if you upload a sdist first no dependency information is encoded.... At one point only twine worked to upload dependency information. If you uploaded with setuptools it didn't get sent no matter what."
Donald also noted, on parseability of that info, "We [Warehouse] do not currently parse anything inside of a wheel, in part because we never did, in part because upload already takes forever and the more stuff we do the longer it takes. I think our timeout on upload is multiple minutes, because that's how long it takes sometimes." (That's a reason for #7730 but we should not block on that.)
"We might want to tweak the JSON API a bit just to make it suitable for the primary use case I think people want it for, and when I say tweak, I basically mean add a field or two to a dict inside of alist"