Skip to content

Deleted page not "really" deleted #141

@packdat

Description

@packdat

While working on incremental updates (see #112) and adding support for deleted objects, i encountered a behavior that may or may not be intended.

When i delete a page from a document, it gets removed from the pages-array as expected.
When checking the output-file, I observed, that only the page-reference was removed from the pages-array, the page itself and all referenced objects (i.e. content-streams) are still present in the file.

If I understand correctly, the method PdfCrossReferenceTable.Compact() is intended to clean up these objects, is that true ?
At least it would clean up (i.e. remove) the page and the objects referenced by that page, if the pages-array were the only place where the page is referenced.
But a page could be referenced from multiple locations, some places that come to mind:

  • Outlines
  • Named Destinations
  • Link-Annotations (and Annotations in general, /P entry)
  • GoTo Actions

In my case, the page, that was not deleted was referenced (at least) by 3 different outlines.

Simple test-case (add it to PdfSharp.Tests.IO.WriterTests):

[Fact]
public void Deleted_Page_Not_Really_Deleted()
{
    var sourceFile = IOUtility.GetAssetsPath("archives/grammar-by-example/GBE/ReferencePDFs/WPF 1.31/Table-Layout.pdf")!;
    var targetFile = Path.Combine(Path.GetTempPath(), "AA-Original.pdf");
    File.Copy(sourceFile, targetFile, true);

    using var fs = File.Open(targetFile, FileMode.Open, FileAccess.Read);
    using var doc = PdfReader.Open(fs, PdfDocumentOpenMode.Modify);
    doc.Pages.RemoveAt(0);

    targetFile = Path.Combine(Path.GetTempPath(), "AA-Deleted.pdf");
    doc.Save(targetFile);
}

Open the file AA-Deleted.pdf and observe, the page and it's contents are still present.

Question:
Is this the intended behavior ?
Are there other CleanUp-methods I'm not aware of ?

IMHO the methods to remove pages are "high level" methods and the library should take care of the "low level" stuff, including cleaning up after itself to maintain the integrity of the document.

I do understand however, that this might not be an easy issue to solve.
In theory, the library has to scan the whole document to find references to deleted pages and then has to decide based on the context (where the reference is found), how to deal with it.

  • delete Outlines and re-link the remaining ones
  • delete Annotations
  • etc...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions