-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with bullet points in PDFs #81
Comments
Thanks for the feedback! Please always provide an example PDF page for problem reproduction. |
Sure, I've attached it. I just created this simple doc using Google Docs to try. Moreover, I have been using this package for more complex cases that includes parsing different kinds of PDFs of Documentation, and Wiki pages, and there we might have other types of bullet points, so at that moment this small number of bullet lists might not be sufficient |
Hey @JorjMcKie Any updates on this? |
Yes - there is no current support for multi-level bullet points. This will not be implemented any time soon either. The more basic issue (of not recognized bullets) is still under investigation. I am currently out of town, so bear with me for at least another week or maybe two. |
Fixed in v0.0.17. |
Hello there,
First of all thanks for this amazing library 🙌
I am facing some issues with the bullet points in the generated markdown. I have tried several different kinds of bullet points to test if the markdown contains the bullet points and indented bullet points.
For example, I just created a Document file with some bullet points, which you can see below
Now I exported this doc as a PDF and tried running
to_markdown
on this, and as a result, I got this as an outputThere are few observations that I have made looking at this output,
\n
but not\t
-
before the text, and looking at the codebase I saw that there is thisbullet
list that is getting compared withRAG/pymupdf4llm/pymupdf4llm/helpers/pymupdf_rag.py
Lines 43 to 51 in 8c0f500
But digging a bit deeper in the code, I found that sometimes the bullet points are not even getting parsed in the text, to check against this bullet list
RAG/pymupdf4llm/pymupdf4llm/helpers/pymupdf_rag.py
Lines 505 to 506 in 8c0f500
Can anyone help me with this?
The text was updated successfully, but these errors were encountered: