Skip to content

Conversation

@richardye101
Copy link
Contributor

@richardye101 richardye101 commented Mar 7, 2025

Currently, markitdown parses pptx shapes in Z-order, the order in which shapes are stacked on top of each other starting from the back to the front.

There are repos that parse pptx to markdown which read the shapes in a normal reading order (top-to-bottom, left-to-right order) like https://github.com/ssine/pptx2md/blob/39bef65b312035baeade932aad8d221e37daae5f/pptx2md/parser.py#L249.

There are also stackoverflow posts that explain how to implement this code: https://stackoverflow.com/questions/51999656/how-to-extract-text-from-powerpoint-text-boxes-in-their-order-within-the-presen

I've simply copied over what @ssine has created in his repo, as it's the cleanest implementation.

@richardye101
Copy link
Contributor Author

@microsoft-github-policy-service agree

@afourney
Copy link
Member

afourney commented Mar 7, 2025

It appears that attrgetter is not included.

@afourney
Copy link
Member

afourney commented Mar 7, 2025

Thanks! Nice and simple fix.

@afourney afourney merged commit 0229ff6 into microsoft:main Mar 7, 2025
3 checks passed
@richardye101 richardye101 deleted the patch-2 branch March 8, 2025 00:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants