-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic .dedupe.once.title, sometimes #322
Comments
Hello @lemon24! 👋🏼 I'd like to help with this issue if possible. I could use a bit of help though :) Taking a look at the problematic feed, I don't see content/summary fields, but you mentioned they probably had changed. Maybe they are gone now? Am I missing something? Beyond that, I'm thinking about how the solution would look like:
What do you think? Thanks 🙏🏼 and great project 💯 |
Hi @davidag, thank you for your interest!
I checked a backup and the old entries didn't have content/summary either, so the pairs were not deduped because the body of these for loops never got a chance to run (and wouldn't have, unless both entries in a pair had content). This is partly by design, the current code tries very hard not to delete data – "when in doubt, keep both".
Indeed, most of the logic should happen in after_feed_update (the stuff in after_entry_update should have probably been there from the start). Here's what I believe the complete logic may look like; it matches your outline (with one difference noted below): def after_entry_update_hook:
tag new entries with '.dedupe._new'
def after_feed_update_hook:
# optimization, not possible at the moment;
# would require the hook to receive the UpdatedFeed,
# or get_entries(tags='.dedupe._new') (filtering by entry tags)
if there are no new entries:
return
collect all entry ids and titles
group collected entries by title
exclude groups with no more than 1 entry
if feed does not have any '.dedupe.once*' tag:
exclude groups that do not have new entries
# optimization
if there are no groups:
clear '.dedupe._new' tag from entries
return
# select how strict we are about what we consider duplicates
if feed has '.dedupe.once.title' tag:
# user said so
is_duplicate = is_duplicate_title
elif (
none of the old entries have duplicate titles
and none of the new entries have duplicate titles
and most new entries have old entries with the same title
)
# reasonably safe to dedupe by title alone
is_duplicate = _is_duplicate_title
else:
# similarity dedupe
is_duplicate = _is_duplicate_full
run _dedupe_entries for each group (original logic)
clear '.dedupe._new' tag from entries Some notes:
Once again, thank you, and don't hesitate to ask any follow-up questions if needed. |
I got a feed with duplicate entries because the ids for all the entries changed; content dedupe didn't work for (most of?) them, likely because the content formatting/suffixes changed (todo: check).
I fixed it with .dedupe.once.title, checking beforehand that:
There's no reason the plugin can't do these checks in code.
The text was updated successfully, but these errors were encountered: