Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deviantart literature items are no longer being downloaded, the html ends up nearly empty #6207

Closed
left1000 opened this issue Sep 18, 2024 · 28 comments

Comments

@left1000
Copy link

deviantart literature items are no longer being downloaded, the html ends up nearly empty

There's a tiny bit of html for the page layout but the entire contents of the literature post is empty and missing, it worked for years, broke sometime in the past 1-2 months... I kept running rip-updates without noticing since the files weren't totally blank... not sure when/what broke it

Anyone know what I'm talking about?

@mikf
Copy link
Owner

mikf commented Sep 19, 2024

Duplicate of #6196

@left1000
Copy link
Author

This is not quite a duplicate I meant literature items in the gallery not status or journals... like.. https://www.deviantart.com/tag/literature if you uh open something random from that search result you get something like NSFW warning: https://www.deviantart.com/milflover5335/gallery/93420151/stories-literature

they're gallery items not journals or posts.... but uh yeah... them being broken they're likely broken in the same manner? I guess? Still my issue is uh a slight expansion of the other comments

@left1000
Copy link
Author

also it should totally be fixable gallery literature items can be downloaded manually in chrome by using save as the entire page and saving it as html... maybe uh posts and journals can't be fixed because they're not inside gallery at all? but literature deviations are inside gallery?

@mikf
Copy link
Owner

mikf commented Sep 21, 2024

I didn't realize DA has "literature items", which are somehow not the same as journals. Nonetheless, they do internally get processed the same way as journals do, meaning they use the same API endpoint to get their full text content, which is currently broken.

also it should totally be fixable

I'll probably find some workaround, but this wouldn't be necessary in the first place if DA wouldn't break their site ...

@AtomicTEM
Copy link

When this gets fixed, would it be possible to add an option to overwrite journals, statuses and literature that when it was downloaded was empty?

@left1000
Copy link
Author

testing for emptiness is probably just a waste of effort, you could just delete all of them from the past few months and/or overwrite all of them, they're small files being just text html files.

@AtomicTEM
Copy link

Okay so how do I have it where I rescrap an account, have it skip all submission that aren't text/html?

@mikf
Copy link
Owner

mikf commented Sep 28, 2024

In general

--filter "extension == 'htm'" --no-skip -o original=false

but it might be better to directly use a user's /posts URL (or -o include=journal) where you'd only need --no-skip.

@mikf mikf closed this as completed Sep 28, 2024
@a-washing-machine
Copy link

For informational purposes to anyone having to delete broken html-downloads -

I don't know when the site-change took place exactly, but it must've been after August 20th, as I still have proper journal downloads for that day.

If anybody has a later date for reference, feel free to post it here for the benefit of those having to delete broken html files for re-download.

@left1000
Copy link
Author

left1000 commented Sep 29, 2024

This issue is not fixed, it might be fixed for posts, but NOT for literature deviations. The fix has changed the error though from a blank story to like 2 paragraphs of story out of 12 paragraphs that said literature deviation might have... I cannot find any examples of lengthy literature deviations that work with this new fix.

In essence this fix took me up from grabbing 1kilobyte .htm files to grabbing 4kilobyte .htm files when what was needed was a 24 kilobyte .htm file... if that makes any sense..

The newest release might have fixed the issue that this issue is a duplicate of (journal posts) but it didn't fix literature deviation posts (possibly because journal posts are typically far shorter?)

@AtomicTEM
Copy link

This issue is not fixed, it might be fixed for posts, but NOT for literature deviations. The fix has changed the error though from a blank story to like 2 paragraphs of story out of 12 paragraphs that said literature deviation might have... I cannot find any examples of lengthy literature deviations that work with this new fix.

In essence this fix took me up from grabbing 1kilobyte .htm files to grabbing 4kilobyte .htm files when what was needed was a 24 kilobyte .htm file... if that makes any sense..

The newest release might have fixed the issue that this issue is a duplicate of (journal posts) but it didn't fix literature deviation posts (possibly because journal posts are typically far shorter?)

Yes I can corroborate this, thing is this issue also happens to me when I just browse DA normally so I have to reload the page normally. I think its because the workaround is done in a way to mimick normal browsing anot through the API, so you might have to just download at non-peak hours of the site to mitigate the literature being only 2 paragraphs.

@mikf mikf reopened this Sep 29, 2024
@a-washing-machine
Copy link

Literature submissions in stash will also cause an error message and not be downloaded at all.

This may have been overlooked since it isn't possible anymore to make new literature uploads into stash since the Eclipse update, but that doesn't mean those don't still exist, example here: https://sta.sh/09z3557z648

@Hrxn
Copy link
Contributor

Hrxn commented Sep 29, 2024

Yes I can corroborate this, thing is this issue also happens to me when I just browse DA normally so I have to reload the page normally. I think its because the workaround is done in a way to mimick normal browsing anot through the API, so you might have to just download at non-peak hours of the site to mitigate the literature being only 2 paragraphs.

I mean, if this even happens for you on DA in the browser..

Site is just a broken mess since a year or so.

@mikf mikf added this to v1.27.6 Sep 30, 2024
@mikf mikf moved this to DeviantArt Journals/Literature in v1.27.6 Sep 30, 2024
@mikf mikf closed this as completed by moving to DeviantArt Journals/Literature in v1.27.6 Sep 30, 2024
@mikf mikf reopened this Sep 30, 2024
mikf added a commit that referenced this issue Oct 1, 2024
fetch text from HTML __INITIAL_STATE__,
since the API doesn't reliably work and is unusable for sta.sh journals
@mikf mikf closed this as completed Oct 2, 2024
@geoffk777
Copy link

I downloaded the nightly Windows build , but this problem is still not completely fixed on DA. Some HTML literature files in galleries' download completely. But some still download as incomplete 4k files. This happens in the same gallery, and there doesn't seem to be any pattern. There are also a lot of errors about:
[deviantart][warning] 918377197: Failed to extract journal HTML from webpage. Falling back to INITIAL_STATE markup.
[deviantart][warning] 918377197: Unsupported 'tiptap' markup.
Again this happens to some files but not others. Also, this error occurs on completely downloaded files, but not on some incomplete ones.
In general, this still isn't completely right.

@mikf
Copy link
Owner

mikf commented Oct 3, 2024

918377197

"This deviation has been labeled as containing themes not suitable for all deviants."

Guess I really do have to process tiptap markup.

In general, this still isn't completely right.

This is a workaround and wouldn't be necessary at all if DA didn't break its website yet again.

@geoffk777
Copy link

"This is a workaround and wouldn't be necessary at all if DA didn't break its website yet again."

I totally sympathize. Gallery-dl is a great tool and I totally appreciate the work that you're putting into it. Sincerely, thanks!

And yes, DA sux donkey balls.

@mikf
Copy link
Owner

mikf commented Oct 7, 2024

Generating HTML from tiptap markup is now supported (a9671f1), so even "mature" journals/literature can now be downloaded again.

The generated HTML is not 100% accurate (some whitespace is somehow different, maybe \n and \n\r mismatch; deviation embeds don't have all metadata entries), but text and layout should match DA's HTML.

@geoffk777
Copy link

Thanks for the fix. However, when I tried to run it, I immediately got an error:
PS C:\Gallery-dl> .\gallery-dl.exe -d "d:\Emule" --verbose "https://www.deviantart.com/Springbokkx/gallery/all"
[gallery-dl][debug] Version 1.27.6-dev:2024.10.07 - Executable (dev/windows)
[gallery-dl][debug] Python 3.12.6 - Windows-10-10.0.17763-SP0
[gallery-dl][debug] requests 2.32.3 - urllib3 2.2.3
[gallery-dl][debug] Configuration Files ['%USERPROFILE%\gallery-dl.conf']
[gallery-dl][debug] Starting DownloadJob for 'https://www.deviantart.com/Springbokkx/gallery/all'
[deviantart][debug] Using DeviantartGalleryExtractor for 'https://www.deviantart.com/Springbokkx/gallery/all'
[deviantart][debug] Using custom API credentials (client-id 18783)
[deviantart][debug] Sleeping 2.00 seconds (api)
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): www.deviantart.com:443
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/user/profile/springbokkx HTTP/11" 200 1321
[deviantart][debug] Sleeping 2.00 seconds (api)
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/gallery/all?username=Springbokkx&offset=0&limit=24&mature_content=true HTTP/11" 200 None
[deviantart][debug] Switching to private access token
[deviantart][debug] Sleeping 2.00 seconds (api)
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/gallery/all?username=Springbokkx&offset=0&limit=24&mature_content=true HTTP/11" 200 None
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /springbokkx/art/Penny-A-Terrifying-Life-of-Luxury-Force-Feeding-1104345126 HTTP/11" 200 None
[deviantart][warning] 1104345126: Failed to extract journal HTML from webpage. Falling back to INITIAL_STATE markup.
[deviantart][warning] Unsupported content type 'horizontalRule'
[deviantart][error] An unexpected error occurred: KeyError - 'content'. Please run gallery-dl again with the --verbose flag, copy its output and report this issue on https://github.com/mikf/gallery-dl/issues .
[deviantart][debug]
Traceback (most recent call last):
File "gallery_dl\job.py", line 151, in run
File "gallery_dl\extractor\deviantart.py", line 180, in items
File "gallery_dl\extractor\deviantart.py", line 391, in _extract_journal
File "gallery_dl\extractor\deviantart.py", line 408, in _textcontent_to_html
File "gallery_dl\extractor\deviantart.py", line 420, in _tiptap_to_html
File "gallery_dl\extractor\deviantart.py", line 438, in _tiptap_process_content
KeyError: 'content'
PS C:\Gallery-dl>

mikf added a commit that referenced this issue Oct 9, 2024
- support literature link embeds
- support @ mentions
- support more text styles
@mikf
Copy link
Owner

mikf commented Oct 9, 2024

@geoffk777 fixed in cfb7b3d.

All literature of https://www.deviantart.com/Springbokkx is now downloadable without errors.

@left1000
Copy link
Author

left1000 commented Oct 9, 2024

Is there a way to download an exe that uses commit cfb7b3d or do I have to wait a week for the next release? I swear there used to be a way to download builds from github via some obscure link to click on that I can no longer remember or find in the UI (I don't have the tools to build myself at this time.)

@mikf
Copy link
Owner

mikf commented Oct 9, 2024

@left1000 https://github.com/gdl-org/builds/releases

If you are already using an exe with version 1.27.0 or higher, you can use

gallery-dl --update-to dev

@geoffk777
Copy link

@Mlkf Thanks!! I downloaded a number of different DA literature galleries and confirmed that the current build seems to fix all of the problems. So this issue can finally be closed. Until DA screws it up again....

@left1000
Copy link
Author

remind me please what is the flag to ignore archive.sqlite3 and recheck all files and not redownload files that already exist?

@left1000
Copy link
Author

left1000 commented Oct 10, 2024

actually this fix not only fixes the past 55 or so days it was totally broken but downloads a superior .htm file to the ones it's been downloading for years (has more formatting data or some such? looks a bit better, isn't taking up 70% of the screen weirdly)

so actually uh what's the flag to uh download all .htm files regardless of if they're repeats?

--filter "extension in ('htm','html)" --skip=false ? or is it something else that would do that?

or is it --skip=none or is none the same as false?

@mikf
Copy link
Owner

mikf commented Oct 10, 2024

remind me please what is the flag to ignore archive.sqlite3 and recheck all files and not redownload files that already exist?

#6207 (comment)

but downloads a superior .htm file to the ones it's been downloading for years (has more formatting data or some such?

It now includes DA's current .css to make literature embeds work.

<link rel="stylesheet" href="https://static.parastorage.com/services\
/da-deviation/2bfd1ff7a9d6bf10d27b98dd8504c0399c3f9974a015785114b7dc6b\
/app.min.css"/>

@left1000
Copy link
Author

left1000 commented Oct 10, 2024

In general

--filter "extension == 'htm'" --no-skip -o original=false

but it might be better to directly use a user's /posts URL (or -o include=journal) where you'd only need --no-skip.

how is that different from --filter "extension in ('htm')" --skip=false I had thought --skip=false was similar to --no-skip but maybe they're totally unrelated?

edit: my command is going to take a lifetime though, it seems to be attempting to download every file before realizing it only wants .htm files.... but maybe there is no way to ask if the target file is a .htm file without the same api load as a full download?

image

@mikf
Copy link
Owner

mikf commented Oct 10, 2024

but maybe there is no way to ask if the target file is a .htm file without the same api load as a full download?

-o original=false

@left1000
Copy link
Author

Ahh -o original=false warps the speed, BUT, it also results in not overwriting .htm files I already had (which I now want to do to gain .css BUT it's still a good idea, I can run orginal false and then run without it afk for a week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: DeviantArt Journals/Literature
Development

No branches or pull requests

6 participants