Skip to content

html serializer for table is too slow #279

Open
@ta0a2000t

Description

@ta0a2000t

Bug

HTMLTableSerializer is using an inefficient method to read the table.

                    cell: TableCell = item.data.grid[i][j]

however, in the markdown serializer, cell is accessed differently

            rows = [
                [
                    # make sure that md tables are not broken
                    # due to newline chars in the text
                    col.text.replace("\n", " ")
                    for col in row
                ]
                for row in item.data.grid
            ]

Conclusion:

grid is made of a 2d linked list, so bracket access is O(n**2)

...

Steps to reproduce

document.export_to_html() # slow, cant finish with large tables in xlsx document

vs

document.export_to_markdown() # time is okay

...

Docling version

Docling version: 2.31.0
Docling Core version: 2.28.1
Docling IBM Models version: 3.4.2
Docling Parse version: 4.0.1
Python: cpython-312 (3.12.0)

...

Python version

Python 3.12.0

...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions