-
Notifications
You must be signed in to change notification settings - Fork 2
enh(preprocessing): Add split_markdown_by_headings. #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enh(preprocessing): Add split_markdown_by_headings. #14
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daavoo - tested this out.
For some of my pdfs, this change did improve the sections retrieved.
For a couple of them, I got worse results:
["epoch 5","epoch 2","lora: low-rank adaptation of large lan### guage models","of trainable parameters = 18m"]
EU AI Act.pdf (this is a very long doc):
["2024_1689 12.7.2024","en oj l, 12.7.2024"]
I think it could be worth investigating and adding some logic to deal with these cases. WDYT?
@stefanfrench it is worse than
I think the root cause for these is not the "splitter" but rather the pdf to markdown conversion. I have tried pymupdf4llm and markitdown and they both fail to generate markdown headers for these PDFs. For the LORA paper, they create a markdown where the header is represented like: ABSTRACT
An important paradigm of natural language processing consists of large-scale pre-
training on general domain data and adaptation to particular tasks or domains. As
...
1
INTRODUCTION
Many applications in natural language processing rely on adapt-
ing one large-scale, pre-trained language model to multiple down-
... Do you think it would be worth trying to extend the splitter with a regex pattern matching a line with all caps and nothing else? For the EU AI Act, I don't really know why but the markdown is even trickier as it looks like: properly collect, store and interpret the logs in accordance with Article 12.
Article 14
Human oversight
High-risk AI systems shall be designed and developed in such a way, including with appropriate human-machine I don't really know how we could extract that without potentially messing other inputs (and creating a pdf-specific split doesn't seem like a good solution) |
@daavoo - Yes I'm actually getting the same with I think it would be worth extending the splitter with a regex pattern matching a line with all caps and nothing else so that it works for the LORA type pdf scenarios. I think this is probably common enough to warrant the effort. Lets ignore the Defintely worth mentioning these challenges in the Blog post as I would have thought it would be more solved. We could also take a look at marker, do you think this would help at all or would we run into the same issues? |
It might have been back when the pdf to markdown was done with docling. It was better for the LORA (headings were detected) but worse for many of the rest.
Would give it a try.
Will do.
I will give it a try. I was trying to avoid "expensive" preprocessing as marker and docling (optionally) use CV models to identify the layout. Might just include them as an option in the benchmark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daavoo Based on re-testing this after re-installing dependencies, this worked well for the Lora paper. Lets proceed without Marker for now.
Replaces https://pypi.org/project/langchain-text-splitters/