Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with bullet points in PDFs #81

Closed
Jaimish00 opened this issue Jul 26, 2024 · 5 comments
Closed

Issues with bullet points in PDFs #81

Jaimish00 opened this issue Jul 26, 2024 · 5 comments
Labels
bug Something isn't working fix developed

Comments

@Jaimish00
Copy link

Jaimish00 commented Jul 26, 2024

Hello there,

First of all thanks for this amazing library 🙌

I am facing some issues with the bullet points in the generated markdown. I have tried several different kinds of bullet points to test if the markdown contains the bullet points and indented bullet points.

For example, I just created a Document file with some bullet points, which you can see below
image

Now I exported this doc as a PDF and tried running to_markdown on this, and as a result, I got this as an output

Hello There\n\nI am testing the bullet points here\n\n○ Just to see if markdown is generated properly\n\n■ And is it able to keep the formatting intact\n\n\n-----\n\n

There are few observations that I have made looking at this output,

  1. It's not able to get the indentation correctly, it's just adding new lines \n but not \t
  2. The first level bullet point is not getting rendered at all, as you can see in the first and second line it is not appending - before the text, and looking at the codebase I saw that there is this bullet list that is getting compared with
    bullet = (
    "- ",
    "* ",
    chr(0xF0A7),
    chr(0xF0B7),
    chr(0xB7),
    chr(8226),
    chr(9679),
    )

But digging a bit deeper in the code, I found that sometimes the bullet points are not even getting parsed in the text, to check against this bullet list

if text.startswith(bullet):
text = "- " + text[1:]

Can anyone help me with this?

@JorjMcKie
Copy link
Contributor

Thanks for the feedback!

Please always provide an example PDF page for problem reproduction.
In your specific situation you might want to suggest additional bullet point characters to add to that list.

@Jaimish00
Copy link
Author

LLM - Bullet Points test.pdf

Sure, I've attached it. I just created this simple doc using Google Docs to try.

Moreover, I have been using this package for more complex cases that includes parsing different kinds of PDFs of Documentation, and Wiki pages, and there we might have other types of bullet points, so at that moment this small number of bullet lists might not be sufficient

@Jaimish00
Copy link
Author

Hey @JorjMcKie

Any updates on this?

@JorjMcKie
Copy link
Contributor

JorjMcKie commented Aug 5, 2024

Yes - there is no current support for multi-level bullet points. This will not be implemented any time soon either.

The more basic issue (of not recognized bullets) is still under investigation. I am currently out of town, so bear with me for at least another week or maybe two.

@JorjMcKie
Copy link
Contributor

Fixed in v0.0.17.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fix developed
Projects
None yet
Development

No branches or pull requests

2 participants