Open
Description
Bug
HTMLTableSerializer is using an inefficient method to read the table.
cell: TableCell = item.data.grid[i][j]
however, in the markdown serializer, cell is accessed differently
rows = [
[
# make sure that md tables are not broken
# due to newline chars in the text
col.text.replace("\n", " ")
for col in row
]
for row in item.data.grid
]
Conclusion:
grid is made of a 2d linked list, so bracket access is O(n**2)
...
Steps to reproduce
document.export_to_html() # slow, cant finish with large tables in xlsx document
vs
document.export_to_markdown() # time is okay
...
Docling version
Docling version: 2.31.0
Docling Core version: 2.28.1
Docling IBM Models version: 3.4.2
Docling Parse version: 4.0.1
Python: cpython-312 (3.12.0)
...
Python version
Python 3.12.0
...
Metadata
Metadata
Assignees
Labels
No labels