Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DeviantArt] Duplicates in database #1874

Closed
kattjevfel opened this issue Sep 20, 2021 · 7 comments
Closed

[DeviantArt] Duplicates in database #1874

kattjevfel opened this issue Sep 20, 2021 · 7 comments

Comments

@kattjevfel
Copy link
Contributor

kattjevfel commented Sep 20, 2021

When downloading a deviation directly, the entry in gallery-dl's database is like deviantart415470071.png, but when when for example downloading via a gallery, you get deviantartg_fluffytheneko_415470071.png, this easily leads to duplicates being downloaded, and completely defeats the purpose of the database. My solution would be to only keep the first format, as it would also skip re-downloading images when accounts change name.

sqlitebrowser_2021-09-20_18-12-02

I also have countless of entries missing the filename, so really I think the format should just be deviantart{index}

@mikf
Copy link
Owner

mikf commented Sep 20, 2021

Each deviantart sub-extractor uses its own archive id scheme for reasons: g_{_username}_{index}.{extension} for galleries, {index}.{extension} for single posts, etc (it kind of makes sense for favorites and such).
You are absolutely right that {index}.{extension} in general would be the best, but changing the internal defaults would invalidate someone else's archive entries and I'd rather avoid that if possible.
For your own setup, you can use the archive-format option to change the default value to something more reasonable.

@Hrxn
Copy link
Contributor

Hrxn commented Sep 20, 2021

I take a part of the blame here, because I've argued in favor of those reasons here in the past 😄

The rationale here is basically the principle of least surprise, but this is somewhat subjective and ultimately a question of personal priorities (in effect, not having avoiding duplicates as the highest priority), but I'll admit it's debatable whether this is really the most reasonable choice.

DeviantArt is kind of extreme here, with 10 different archive_fmt used as defaults for the different subcategories.

But this also depends on the cooperation of the site, to some extent. The identifier used here is apparently the {index} field, as a good reference for the user side archive we depend on the site to actually have working globally unique IDs for their hosted content. I think DeviantArt passes this test, as far as I know, but some sites had issues with this in ye olden days.

@kattjevfel
Copy link
Contributor Author

I wasn't aware of archive-format, I've set that now and I guess that'll be good enough for now, but it sure has caused a lot of frustration in the past, especially since I have a post-processor that converts pngs to webp, but the tool wont overwrite, so several times I've ended up with duplicates that are hard to deal with :p

Anyway, can we at least set the same for galleries and direct link to posts? Aka set g_{_username}_{index}.{extension} for that one too, this would get rid of duplicate entries and I don't think it'd mess anything up.

@rautamiekka
Copy link
Contributor

g_{_username}_{index}.{extension}

This leads to the problem where replacing the upload with a modified one, one which could be a better one, is skipped.

Conversely, having the upload's name leads to potential dupes, but pretty sure you can't upload anything so massive it'll be a problem if downloaded multiple times, and it's always better to have a potential dupe than replace with a potentially damaged, let alone inferior, one.

I'd recommend:

{category}_{author[username]}_{index}_{date:%Y-%m-%d_%H_%M_%S}_{title}.{extension}

^ Example:

deviantart_Olivergriffiths_892443878_2021-09-20_13_34_31_MLP_ Watercolor commission.jpg

^ That's a 24h time cuz I didn't find a direct formatting code in Python docs.

@Hrxn
Copy link
Contributor

Hrxn commented Sep 21, 2021

[..] especially since I have a post-processor that converts pngs to webp, but the tool wont overwrite, so several times I've ended up with duplicates that are hard to deal with :p

Well, there are tools specifically made to deal with duplicate files. These here should even support similar image search, which might help with manually converted images etc.

Anyway, can we at least set the same for galleries and direct link to posts? Aka set g_{_username}_{index}.{extension} for that one too, this would get rid of duplicate entries and I don't think it'd mess anything up.

Fine with me...

@kattjevfel
Copy link
Contributor Author

Well, there are tools specifically made to deal with duplicate files. These here should even support similar image search, which might help with manually converted images etc.

I think we misunderstood each other, what ends up happening is that the file has been downloaded by gallery-dl, then converted to another format. Then gallery-dl downloads that file again, suddenly I have one version in png, and one in webp. But the safety measure to not overwrite files kicks in and I am stuck with both files.

All in all, external problem, but it gets triggered by the database not doing its job :P

mikf added a commit that referenced this issue Sep 23, 2021
@mikf
Copy link
Owner

mikf commented Sep 23, 2021

Anyway, can we at least set the same for galleries and direct link to posts? Aka set g_{_username}_{index}.{extension} for that one too, this would get rid of duplicate entries and I don't think it'd mess anything up.

Done (ada36c2)

@mikf mikf closed this as completed Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants