Skip to content

BL 15848 DeDuplicateMediaFiles#7814

Open
nabalone wants to merge 2 commits intomasterfrom
BL-15848-DeDuplicateMediaFiles
Open

BL 15848 DeDuplicateMediaFiles#7814
nabalone wants to merge 2 commits intomasterfrom
BL-15848-DeDuplicateMediaFiles

Conversation

@nabalone
Copy link
Copy Markdown
Contributor

@nabalone nabalone commented Apr 6, 2026

  • WIP (BL-15848)
  • BL-15848 Deduplicate Bloom App media files

This change is Reviewable


Open with Devin

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 3 files

@nabalone nabalone force-pushed the BL-15848-DeDuplicateMediaFiles branch from 4fd5416 to 2c6b785 Compare April 8, 2026 13:11
Comment on lines +173 to +175
// BL-15848 Deduplication to save space. After publish-time media transforms so we compare the actual bloompub payload,
// not source files that may still be resized, trimmed, or rewritten later in the pipeline.
PublishHelper.DeDuplicateMediaFiles(modifiedBook.RawDom, modifiedBook.FolderPath);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 DeDuplicateMediaFiles only in BloomPub path, not ePub

DeDuplicateMediaFiles is called in CreateBloomPub (BloomPubMaker.cs:175), but the ePub path (EpubMaker.cs) only gets DeDuplicateGifs via MakeDeviceXmatterTempBook. This means ePub publishing doesn't benefit from full media deduplication (images, videos, non-talking audio). This may be intentional — ePub has different packaging requirements — but it's worth confirming whether ePub output could also benefit from this optimization.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The card did not include epubs

DeDuplicateReferencedMedia(GetImageMediaReferences(dom), folderPath);
DeDuplicateReferencedMedia(GetVideoMediaReferences(dom), folderPath);
// Narration files are tied to specific spans, so keep those one-to-one file names stable.
var talkingBookAudioFileNames = GetTalkingBookAudioFileNames(dom);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot made this talkingBookAudioFileNames blacklist and it's generally redundant, since in GetNonTalkingAudioMediaReferences() we don't pick up TBT recordings anyway. But I guess it guards against the case where someone records TBT audio and then selects that same audio file (from within the book folder) as music within the same book multiple times. But that sounds to me like a diabolical corner case... should I remove this blacklist?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that the blacklist hurts anything.

}

foreach (
var elementWithCorrectSound in dom.SafeSelectNodes(".//*[@data-correct-sound]")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 6.4 reuses correct-sound and wrong-sound files but 6.3 doesn't?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6.3 also shared the correct and wrong sound files from the built-in list. Users can choose their own sound files, and those aren't shared. This behavior has not changed in 6.4.

@nabalone nabalone force-pushed the BL-15848-DeDuplicateMediaFiles branch from 2c6b785 to 473e75a Compare April 8, 2026 14:18
Copy link
Copy Markdown
Contributor

@StephenMcConnel StephenMcConnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks okay. One real fix and one suggested syntax change. Adding the deduplication to epubs probably wouldn't be too hard, would it? but would it be worth it? We could always wait until someone asks for it specifically.

@StephenMcConnel reviewed 2 files and all commit messages, made 5 comments, and resolved 2 discussions.
Reviewable status: 2 of 3 files reviewed, 3 unresolved discussions (waiting on nabalone).


src/BloomExe/Publish/PublishHelper.cs line 1226 at r2 (raw file):

                {
                    RelativePath = relativePath,
                    RewriteReference = canonicalRelativePath =>

data-book="coverImage" has both the innerText and a src attribute set to the filename. data-book="licenseImage" has only the innerText set. We need to preserve / revise the src attribute for the cover image to keep in sync with the innerText.


src/BloomExe/Publish/PublishHelper.cs line 1372 at r2 (raw file):

        {
            using var stream = RobustFile.OpenRead(filePath);
            return Convert.ToHexString(SHA256.HashData(stream));

Does this syntax work? I assume it must, but I prefer using the braces { } to make the scope of the using more obvious. But you don't necessarily need to change this if it's working.

DeDuplicateReferencedMedia(GetImageMediaReferences(dom), folderPath);
DeDuplicateReferencedMedia(GetVideoMediaReferences(dom), folderPath);
// Narration files are tied to specific spans, so keep those one-to-one file names stable.
var talkingBookAudioFileNames = GetTalkingBookAudioFileNames(dom);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that the blacklist hurts anything.

}

foreach (
var elementWithCorrectSound in dom.SafeSelectNodes(".//*[@data-correct-sound]")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6.3 also shared the correct and wrong sound files from the built-in list. Users can choose their own sound files, and those aren't shared. This behavior has not changed in 6.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants