-
Notifications
You must be signed in to change notification settings - Fork 948
Fixes for repo url update on move detection #3391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@MoralCode : Your root cause analysis sounds DEAD ON, and explains why this appears not to be working in the case of automatic moving. This will fix that issue. |
c362bb9 to
4eaae5f
Compare
4eaae5f to
f7a6249
Compare
sgoggins
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
Still waiting on testing this to 100% confirm that this will update the repo_git url |
Will hold off on marking it ready and merging until you green light it then. |
f7a6249 to
cff5aa7
Compare
|
Current issues with this: seems to be a dependence on using the repo URL for querying: |
|
@MoralCode Would changing that query to be based on the repo_src_id fix the issue? Or does github require the URL to get to the repo_src_id? |
|
Thats what I was thinking, I just have to do it. And I suspect the code for it is buried somewhere in augurs various functions. |
|
The above stack trace seems to be happening in the error handler for augur's celery tasks. The fact that it is still repo_url based is a different tech debt issue. But the fact that we are getting it comes from us throwing an exception to stop collection on repo move or delete. This is the subject of #3166. ill likely try and solve both in this PR |
2bd7482 to
70e1ebd
Compare
|
@MoralCode : This one appears ready. |
It largely is, however, I would like to also include a new database table alongside this fix so that, when the repo url gets updated, the old one gets saved in a we can merge this, but data will be lost until that secondary change is in as well, and because that secondary change requires a database migration, its largely blocked on some of the database sync/organizing PRs that are being reviewed currently |
…nge it Discovered by gpt5 via claude Signed-off-by: Adrian Edwards <adredwar@redhat.com>
Discovered by gpt5 via claude Signed-off-by: Adrian Edwards <adredwar@redhat.com>
Signed-off-by: Adrian Edwards <adredwar@redhat.com>
…tch them to re-emit celery exceptions. Signed-off-by: Adrian Edwards <adredwar@redhat.com>
70e1ebd to
d29e299
Compare
Signed-off-by: Adrian Edwards <adredwar@redhat.com>
…iases table Signed-off-by: Adrian Edwards <adredwar@redhat.com>
Signed-off-by: Adrian Edwards <adredwar@redhat.com>
Signed-off-by: Adrian Edwards <adredwar@redhat.com>
d29e299 to
e9b3f5a
Compare
Ok so this was basically due to how the retry behavior in celery works. I think it was retrying the same task with the same URL, but now that the repo table has a new URL it wasnt finding it. |
|
OK this change now contains the new table and the code to populate it on move. Therefore it officially fixes #3129 🎉 I tested this with mild effort locally. I am noticing values populate the new tables when the task runs, and was able to fix a few issues with the code, but please someone else also test this. here is a set of repos I have used that still have active redirects (test one at a time so you can iterate and not struggle to find new ones because you tested them all at once): |
|
Maintainers call brought up the concern that, when github is redirecting the old url for a moved repo, another new (and different) repo can be created at the old url, and we would need a way to disambiguate. I suspect the way that i presented it (i.e. that we would use this table for primary augur operations) was probably wrong. After talking to cali, it sounds like the best plan is just to always use repo source ID for operations, especially repo uniqueness checking. Essentially this would mean that, newly added repos can use the github API for most cases (valid repo, moved repo, getting the src id), and, if that fails (i.e. repo url is a 404) we can basically fall back to a "best effort" strategy where we then check repo_aliases to see if there is anything, grab the most recent url if there is, and fail if there isnt. @Ulincsys does that work as far as conflict resolution? the goal would be to essentially treat this aliases table as more of a historic log for analysis/not losing data |
|
Im sorry if that had not been clear earlier! Absolutely still using the repo src id for operations (my personal agenda is to push for everything that can be based on the src id to be), but the prior urls are stored for historical reference. In the case of 8knot, that info would be integrated into the search bar at some point. Happy to discuss more, Ive thought about this issue a lot |
|
part of me is a little worried about "use the src id for everything" since it is fundamentally a git-dependent value. I think it makes a ton of sense to use it when interfacing with the outside world (i.e. someone gives us a git url and we need to check if we have it, we should ping the api and check github id for it), but i think for internal stuff (i.e. JOIN queries for data analysis, querying the list of all previously known urls for a repo) should maybe still JOIN on the repo_id. At this point im not sure whether it makes more sense to also include the src_id in this new aliases table or not. Im leaning no because i think it makes sense to treat this table as essentially an internal log of historical names for analysis/informational purposes (and a last-ditch effort to resolve a users URL to a repo that makes some sense before showing an error), but not as a primary form of deduplicating repos |
|
On repo_id: completely agree, I should have been more specific. I meant for checking for uniqueness. |
the src_id is not Git dependent ... each platform has their own integer identifier that follows a repository even if you change its URL. |
Agreed: We need to keep the url's ... they just won't be our primary identifier. |
Do we know whether this applies to all the forges we plan to support (I.e. forgejo, cgit/generic git)? |
|
Also, where does this conversion leave us as far as this PR? If we essentially write to the aliases table as if it is a log of each time a repo URL changes, does that reframing of its purpose prevent the issues that would arise (new different repo reusing an old url that was previously a redirect or something) if we used it for operational deduplication? CC @Ulincsys |
|
@MoralCode that's how I understand it |
|
Thinking about this again, I think either framing has the same issue, but the difference is essentially who deals with it. It sounds like the possibility of the aliases table having two entries (two repo_ids) for the same url is likely rare enough that it can probably be dealt with at the time of data analysis using the collection date to differentiate. @cdolfi does the collection date seem sufficient for distinguishing possible duplicates for analysis |
|
@MoralCode yes, im not concerned personally about how to handle the situation where two repos had the same url at different points of time. Already much less difficult than navigating the current situation |
|
As far as aliases for moved repos are concerned, I think there is no reason to suspect that every platform would allow such a feature to exist. Support for such a feature would need to be implemented on a per-platform basis in Augur, possibly with a Factory or Builder design pattern approach. Here is my line of thinking:
This is because: there is no reason to suspect that a platform which implements URL redirects must also provide a unique global identifier separate from the URL.
In the event that a repo alias for a supported platform returns a conflicting source ID, that entry can simply be deleted. Though I do consider myself to be a tremendous data hoarder, I see no use in maintaining an infinite changelog for repo location histories. Please let me know if there are any questions I can answer; @cdolfi, @MoralCode |
|
|
||
| url = to_insert['repo_git'] | ||
| logger.info(f"Updated repo for {url}\n") | ||
| logger.info(f"Updated repo {old_url} to {url} and set alias\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the event that an IntegrityError occurs in the above try/catch, this log statement becomes untrue.
Either the conflict must be resolved above before continuing, or a separate log statement must be issued when setting an alias fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yep, given how small this table is, I suspect the most likely integrity error is that the url+repo_id combination already exists in the table, so IMO logging a message and continuing is probably ideal, especially given the best-effort-ness of this table
|
I think the existence of this aliases table is more of a best effort/nice to have/convenience tier solution anyway. If we wanted to be thorough about URL history, we would need a way to query that history from somewhere like github since the aliases are only sourced from the URLs people have attempted to load into augur. I think the best effort ness of this generally lines up with johns point that not every platform is likely to even support it. It helps us not actively lose data when we detect a repo move, but I dont think the goal is to be perfectly comprehensive about every url move - just to provide a basic list of other urls we have previously seen a particular repo at. @Ulincsys I guess my core question is: is there anything fundamentally problematic that would prevent us merging this? As Cali mentioned this will help improve the experience of managing duplicate urls by a lot, even if its a stepping stone to a better solution later. |
|
@Ulincsys So personally I do see the value in keeping the historical log of the repo url. Repos can change name/org location but still be known from their prior identifier. It also helpful when doing data analysis around repo donations to foundation and things like that. As well as foundations like apache changes the repo name with their progress through graduation. Having the up to date repo url is definitely the biggest priority and 8knot has had user issues with it for months now but the historical is incredibly useful from an analysis standpoint |
|
Id also say that id think about it like the contributor alias table. To my knowledge it does not/will not be compatible for every source but still useful in the cases where we can get that information |
Description
@cdolfi has reported an issue (#3129) where repos that have moved and are redirecting when visited in a browser are not having their URLs updated to reflect the move.
Using an AI tool to look over the relevant task and identify issues for review, it identified that the
hit_apifunction that was being used for the API calls was internally passingfollow_redirects=Trueto the underlying HTTP library.This explains why the repo urls werent being updated - because all github calls were automatically following the redirect, meaning the check for
response_code == 301later in the code would practically never get called.This is related to #3129 (but also slightly exacerbates it because it doesnt yet store the old url)
Notes for Reviewers
I have yet to test this locally. Trying to see if i can replicate the issue, although i have a lot of repos in my local instance now that I should probably clear out....
Signed commits
Generative AI disclosure
301response comes back without aLocationheader. This code was reviewed and built upon by me (to make it casing-agnostic) before submitting.