deviantart literature items are no longer being downloaded, the html ends up nearly empty #6207

left1000 · 2024-09-18T22:45:23Z

deviantart literature items are no longer being downloaded, the html ends up nearly empty

There's a tiny bit of html for the page layout but the entire contents of the literature post is empty and missing, it worked for years, broke sometime in the past 1-2 months... I kept running rip-updates without noticing since the files weren't totally blank... not sure when/what broke it

Anyone know what I'm talking about?

mikf · 2024-09-19T08:47:25Z

Duplicate of #6196

left1000 · 2024-09-20T12:37:21Z

This is not quite a duplicate I meant literature items in the gallery not status or journals... like.. https://www.deviantart.com/tag/literature if you uh open something random from that search result you get something like NSFW warning: https://www.deviantart.com/milflover5335/gallery/93420151/stories-literature

they're gallery items not journals or posts.... but uh yeah... them being broken they're likely broken in the same manner? I guess? Still my issue is uh a slight expansion of the other comments

left1000 · 2024-09-20T12:39:01Z

also it should totally be fixable gallery literature items can be downloaded manually in chrome by using save as the entire page and saving it as html... maybe uh posts and journals can't be fixed because they're not inside gallery at all? but literature deviations are inside gallery?

mikf · 2024-09-21T07:00:20Z

I didn't realize DA has "literature items", which are somehow not the same as journals. Nonetheless, they do internally get processed the same way as journals do, meaning they use the same API endpoint to get their full text content, which is currently broken.

also it should totally be fixable

I'll probably find some workaround, but this wouldn't be necessary in the first place if DA wouldn't break their site ...

AtomicTEM · 2024-09-23T22:21:11Z

When this gets fixed, would it be possible to add an option to overwrite journals, statuses and literature that when it was downloaded was empty?

(#6196, #6207, #5916)

left1000 · 2024-09-27T11:49:26Z

testing for emptiness is probably just a waste of effort, you could just delete all of them from the past few months and/or overwrite all of them, they're small files being just text html files.

AtomicTEM · 2024-09-27T23:24:29Z

Okay so how do I have it where I rescrap an account, have it skip all submission that aren't text/html?

mikf · 2024-09-28T09:54:59Z

In general

--filter "extension == 'htm'" --no-skip -o original=false

but it might be better to directly use a user's /posts URL (or -o include=journal) where you'd only need --no-skip.

a-washing-machine · 2024-09-28T12:36:10Z

For informational purposes to anyone having to delete broken html-downloads -

I don't know when the site-change took place exactly, but it must've been after August 20th, as I still have proper journal downloads for that day.

If anybody has a later date for reference, feel free to post it here for the benefit of those having to delete broken html files for re-download.

left1000 · 2024-09-29T01:59:34Z

This issue is not fixed, it might be fixed for posts, but NOT for literature deviations. The fix has changed the error though from a blank story to like 2 paragraphs of story out of 12 paragraphs that said literature deviation might have... I cannot find any examples of lengthy literature deviations that work with this new fix.

In essence this fix took me up from grabbing 1kilobyte .htm files to grabbing 4kilobyte .htm files when what was needed was a 24 kilobyte .htm file... if that makes any sense..

The newest release might have fixed the issue that this issue is a duplicate of (journal posts) but it didn't fix literature deviation posts (possibly because journal posts are typically far shorter?)

AtomicTEM · 2024-09-29T05:10:42Z

This issue is not fixed, it might be fixed for posts, but NOT for literature deviations. The fix has changed the error though from a blank story to like 2 paragraphs of story out of 12 paragraphs that said literature deviation might have... I cannot find any examples of lengthy literature deviations that work with this new fix.

In essence this fix took me up from grabbing 1kilobyte .htm files to grabbing 4kilobyte .htm files when what was needed was a 24 kilobyte .htm file... if that makes any sense..

The newest release might have fixed the issue that this issue is a duplicate of (journal posts) but it didn't fix literature deviation posts (possibly because journal posts are typically far shorter?)

Yes I can corroborate this, thing is this issue also happens to me when I just browse DA normally so I have to reload the page normally. I think its because the workaround is done in a way to mimick normal browsing anot through the API, so you might have to just download at non-peak hours of the site to mitigate the literature being only 2 paragraphs.

a-washing-machine · 2024-09-29T13:33:29Z

Literature submissions in stash will also cause an error message and not be downloaded at all.

This may have been overlooked since it isn't possible anymore to make new literature uploads into stash since the Eclipse update, but that doesn't mean those don't still exist, example here: https://sta.sh/09z3557z648

Hrxn · 2024-09-29T17:21:15Z

Yes I can corroborate this, thing is this issue also happens to me when I just browse DA normally so I have to reload the page normally. I think its because the workaround is done in a way to mimick normal browsing anot through the API, so you might have to just download at non-peak hours of the site to mitigate the literature being only 2 paragraphs.

I mean, if this even happens for you on DA in the browser..

Site is just a broken mess since a year or so.

fetch text from HTML __INITIAL_STATE__, since the API doesn't reliably work and is unusable for sta.sh journals

geoffk777 · 2024-10-03T13:37:26Z

I downloaded the nightly Windows build , but this problem is still not completely fixed on DA. Some HTML literature files in galleries' download completely. But some still download as incomplete 4k files. This happens in the same gallery, and there doesn't seem to be any pattern. There are also a lot of errors about:
[deviantart][warning] 918377197: Failed to extract journal HTML from webpage. Falling back to INITIAL_STATE markup.
[deviantart][warning] 918377197: Unsupported 'tiptap' markup.
Again this happens to some files but not others. Also, this error occurs on completely downloaded files, but not on some incomplete ones.
In general, this still isn't completely right.

mikf · 2024-10-03T14:56:28Z

918377197

"This deviation has been labeled as containing themes not suitable for all deviants."

Guess I really do have to process tiptap markup.

In general, this still isn't completely right.

This is a workaround and wouldn't be necessary at all if DA didn't break its website yet again.

geoffk777 · 2024-10-03T17:23:11Z

"This is a workaround and wouldn't be necessary at all if DA didn't break its website yet again."

I totally sympathize. Gallery-dl is a great tool and I totally appreciate the work that you're putting into it. Sincerely, thanks!

And yes, DA sux donkey balls.

mikf · 2024-10-07T06:29:10Z

Generating HTML from tiptap markup is now supported (a9671f1), so even "mature" journals/literature can now be downloaded again.

The generated HTML is not 100% accurate (some whitespace is somehow different, maybe \n and \n\r mismatch; deviation embeds don't have all metadata entries), but text and layout should match DA's HTML.

geoffk777 · 2024-10-07T14:58:47Z

Thanks for the fix. However, when I tried to run it, I immediately got an error:
PS C:\Gallery-dl> .\gallery-dl.exe -d "d:\Emule" --verbose "https://www.deviantart.com/Springbokkx/gallery/all"
[gallery-dl][debug] Version 1.27.6-dev:2024.10.07 - Executable (dev/windows)
[gallery-dl][debug] Python 3.12.6 - Windows-10-10.0.17763-SP0
[gallery-dl][debug] requests 2.32.3 - urllib3 2.2.3
[gallery-dl][debug] Configuration Files ['%USERPROFILE%\gallery-dl.conf']
[gallery-dl][debug] Starting DownloadJob for 'https://www.deviantart.com/Springbokkx/gallery/all'
[deviantart][debug] Using DeviantartGalleryExtractor for 'https://www.deviantart.com/Springbokkx/gallery/all'
[deviantart][debug] Using custom API credentials (client-id 18783)
[deviantart][debug] Sleeping 2.00 seconds (api)
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): www.deviantart.com:443
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/user/profile/springbokkx HTTP/11" 200 1321
[deviantart][debug] Sleeping 2.00 seconds (api)
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/gallery/all?username=Springbokkx&offset=0&limit=24&mature_content=true HTTP/11" 200 None
[deviantart][debug] Switching to private access token
[deviantart][debug] Sleeping 2.00 seconds (api)
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/gallery/all?username=Springbokkx&offset=0&limit=24&mature_content=true HTTP/11" 200 None
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /springbokkx/art/Penny-A-Terrifying-Life-of-Luxury-Force-Feeding-1104345126 HTTP/11" 200 None
[deviantart][warning] 1104345126: Failed to extract journal HTML from webpage. Falling back to INITIAL_STATE markup.
[deviantart][warning] Unsupported content type 'horizontalRule'
[deviantart][error] An unexpected error occurred: KeyError - 'content'. Please run gallery-dl again with the --verbose flag, copy its output and report this issue on https://github.com/mikf/gallery-dl/issues .
[deviantart][debug]
Traceback (most recent call last):
File "gallery_dl\job.py", line 151, in run
File "gallery_dl\extractor\deviantart.py", line 180, in items
File "gallery_dl\extractor\deviantart.py", line 391, in _extract_journal
File "gallery_dl\extractor\deviantart.py", line 408, in _textcontent_to_html
File "gallery_dl\extractor\deviantart.py", line 420, in _tiptap_to_html
File "gallery_dl\extractor\deviantart.py", line 438, in _tiptap_process_content
KeyError: 'content'
PS C:\Gallery-dl>

- support literature link embeds - support @ mentions - support more text styles

mikf · 2024-10-09T19:04:58Z

@geoffk777 fixed in cfb7b3d.

All literature of https://www.deviantart.com/Springbokkx is now downloadable without errors.

left1000 · 2024-10-09T19:35:02Z

Is there a way to download an exe that uses commit cfb7b3d or do I have to wait a week for the next release? I swear there used to be a way to download builds from github via some obscure link to click on that I can no longer remember or find in the UI (I don't have the tools to build myself at this time.)

mikf · 2024-10-09T19:43:38Z

@left1000 https://github.com/gdl-org/builds/releases

If you are already using an exe with version 1.27.0 or higher, you can use

gallery-dl --update-to dev

geoffk777 · 2024-10-09T22:50:45Z

@Mlkf Thanks!! I downloaded a number of different DA literature galleries and confirmed that the current build seems to fix all of the problems. So this issue can finally be closed. Until DA screws it up again....

left1000 · 2024-10-10T00:44:03Z

remind me please what is the flag to ignore archive.sqlite3 and recheck all files and not redownload files that already exist?

left1000 · 2024-10-10T03:55:55Z

actually this fix not only fixes the past 55 or so days it was totally broken but downloads a superior .htm file to the ones it's been downloading for years (has more formatting data or some such? looks a bit better, isn't taking up 70% of the screen weirdly)

so actually uh what's the flag to uh download all .htm files regardless of if they're repeats?

--filter "extension in ('htm','html)" --skip=false ? or is it something else that would do that?

or is it --skip=none or is none the same as false?

mikf · 2024-10-10T06:19:28Z

remind me please what is the flag to ignore archive.sqlite3 and recheck all files and not redownload files that already exist?

#6207 (comment)

but downloads a superior .htm file to the ones it's been downloading for years (has more formatting data or some such?

It now includes DA's current .css to make literature embeds work.

gallery-dl/gallery_dl/extractor/deviantart.py

Lines 2040 to 2042 in cfb7b3d

    
               <link rel="stylesheet" href="https://static.parastorage.com/services\ 
        
           /da-deviation/2bfd1ff7a9d6bf10d27b98dd8504c0399c3f9974a015785114b7dc6b\ 
        
           /app.min.css"/>

left1000 · 2024-10-10T08:15:09Z

In general
--filter "extension == 'htm'" --no-skip -o original=false
but it might be better to directly use a user's /posts URL (or -o include=journal) where you'd only need --no-skip.

how is that different from --filter "extension in ('htm')" --skip=false I had thought --skip=false was similar to --no-skip but maybe they're totally unrelated?

edit: my command is going to take a lifetime though, it seems to be attempting to download every file before realizing it only wants .htm files.... but maybe there is no way to ask if the target file is a .htm file without the same api load as a full download?

mikf · 2024-10-10T08:23:58Z

but maybe there is no way to ask if the target file is a .htm file without the same api load as a full download?

-o original=false

left1000 · 2024-10-10T10:15:08Z

Ahh -o original=false warps the speed, BUT, it also results in not overwriting .htm files I already had (which I now want to do to gain .css BUT it's still a good idea, I can run orginal false and then run without it afk for a week

mikf marked this as a duplicate of #6196 Sep 19, 2024

mikf added duplicate external-issue site:bug labels Sep 19, 2024

mikf added a commit that referenced this issue Sep 27, 2024

[deviantart] work around OAuth API returning empty journal texts

928e170

(#6196, #6207, #5916)

mikf closed this as completed Sep 28, 2024

mikf reopened this Sep 29, 2024

mikf mentioned this issue Sep 30, 2024

[Deviantart] An unexpected error occurred: KeyError - 'deviation'. #6254

Closed

mikf added this to v1.27.6 Sep 30, 2024

mikf moved this to DeviantArt Journals/Literature in v1.27.6 Sep 30, 2024

mikf closed this as completed by moving to DeviantArt Journals/Literature in v1.27.6 Sep 30, 2024

mikf reopened this Sep 30, 2024

mikf added a commit that referenced this issue Oct 1, 2024

[deviantart] fix & improve journal/literature extraction (#6254, #6207)

ed859f0

fetch text from HTML __INITIAL_STATE__, since the API doesn't reliably work and is unusable for sta.sh journals

mikf added a commit that referenced this issue Oct 2, 2024

[deviantart] extract journal HTML from webpage (#6254, #6207, #6196)

7dbd53e

mikf closed this as completed Oct 2, 2024

mikf added a commit that referenced this issue Oct 7, 2024

[deviantart] support converting 'tiptap' markup to HTML (#6207)

a9671f1

mikf added a commit that referenced this issue Oct 9, 2024

[deviantart] improve 'tiptap' conversion (#6207)

cfb7b3d

- support literature link embeds - support @ mentions - support more text styles

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deviantart literature items are no longer being downloaded, the html ends up nearly empty #6207

deviantart literature items are no longer being downloaded, the html ends up nearly empty #6207

left1000 commented Sep 18, 2024

mikf commented Sep 19, 2024

left1000 commented Sep 20, 2024

left1000 commented Sep 20, 2024

mikf commented Sep 21, 2024

AtomicTEM commented Sep 23, 2024

left1000 commented Sep 27, 2024

AtomicTEM commented Sep 27, 2024

mikf commented Sep 28, 2024

a-washing-machine commented Sep 28, 2024

left1000 commented Sep 29, 2024 •

edited

Loading

AtomicTEM commented Sep 29, 2024

a-washing-machine commented Sep 29, 2024

Hrxn commented Sep 29, 2024

geoffk777 commented Oct 3, 2024

mikf commented Oct 3, 2024

geoffk777 commented Oct 3, 2024

mikf commented Oct 7, 2024

geoffk777 commented Oct 7, 2024

mikf commented Oct 9, 2024

left1000 commented Oct 9, 2024

mikf commented Oct 9, 2024 •

edited

Loading

geoffk777 commented Oct 9, 2024

left1000 commented Oct 10, 2024

left1000 commented Oct 10, 2024 •

edited

Loading

mikf commented Oct 10, 2024

left1000 commented Oct 10, 2024 •

edited

Loading

mikf commented Oct 10, 2024

left1000 commented Oct 10, 2024

deviantart literature items are no longer being downloaded, the html ends up nearly empty #6207

deviantart literature items are no longer being downloaded, the html ends up nearly empty #6207

Comments

left1000 commented Sep 18, 2024

mikf commented Sep 19, 2024

left1000 commented Sep 20, 2024

left1000 commented Sep 20, 2024

mikf commented Sep 21, 2024

AtomicTEM commented Sep 23, 2024

left1000 commented Sep 27, 2024

AtomicTEM commented Sep 27, 2024

mikf commented Sep 28, 2024

a-washing-machine commented Sep 28, 2024

left1000 commented Sep 29, 2024 • edited Loading

AtomicTEM commented Sep 29, 2024

a-washing-machine commented Sep 29, 2024

Hrxn commented Sep 29, 2024

geoffk777 commented Oct 3, 2024

mikf commented Oct 3, 2024

geoffk777 commented Oct 3, 2024

mikf commented Oct 7, 2024

geoffk777 commented Oct 7, 2024

mikf commented Oct 9, 2024

left1000 commented Oct 9, 2024

mikf commented Oct 9, 2024 • edited Loading

geoffk777 commented Oct 9, 2024

left1000 commented Oct 10, 2024

left1000 commented Oct 10, 2024 • edited Loading

mikf commented Oct 10, 2024

left1000 commented Oct 10, 2024 • edited Loading

mikf commented Oct 10, 2024

left1000 commented Oct 10, 2024

left1000 commented Sep 29, 2024 •

edited

Loading

mikf commented Oct 9, 2024 •

edited

Loading

left1000 commented Oct 10, 2024 •

edited

Loading

left1000 commented Oct 10, 2024 •

edited

Loading