Skip to content

enh(preprocessing): Add split_markdown_by_headings. #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 27, 2025

Conversation

daavoo
Copy link
Contributor

@daavoo daavoo commented Jan 22, 2025

@daavoo daavoo requested a review from a team January 22, 2025 10:32
@daavoo daavoo self-assigned this Jan 22, 2025
@daavoo daavoo linked an issue Jan 22, 2025 that may be closed by this pull request
Copy link
Contributor

@stefanfrench stefanfrench left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo - tested this out.

For some of my pdfs, this change did improve the sections retrieved.

For a couple of them, I got worse results:

["epoch 5","epoch 2","lora: low-rank adaptation of large lan### guage models","of trainable parameters = 18m"]

EU AI Act.pdf (this is a very long doc):

["2024_1689 12.7.2024","en oj l, 12.7.2024"]

I think it could be worth investigating and adding some logic to deal with these cases. WDYT?

@daavoo
Copy link
Contributor Author

daavoo commented Jan 23, 2025

For a couple of them, I got worse results:

@stefanfrench it is worse than main branch?
I get the same results with https://pypi.org/project/langchain-text-splitters/ .

I think it could be worth investigating and adding some logic to deal with these cases. WDYT?

I think the root cause for these is not the "splitter" but rather the pdf to markdown conversion.

I have tried pymupdf4llm and markitdown and they both fail to generate markdown headers for these PDFs.

For the LORA paper, they create a markdown where the header is represented like:

ABSTRACT

An important paradigm of natural language processing consists of large-scale pre-
training on general domain data and adaptation to particular tasks or domains. As
...

1

INTRODUCTION

Many applications in natural language processing rely on adapt-
ing one large-scale, pre-trained language model to multiple down-
...

Do you think it would be worth trying to extend the splitter with a regex pattern matching a line with all caps and nothing else?

For the EU AI Act, I don't really know why but the markdown is even trickier as it looks like:

image

properly  collect,  store  and  interpret  the  logs  in  accordance  with  Article  12.

Article  14

Human  oversight

High-risk  AI  systems  shall  be  designed  and  developed  in  such  a  way,  including  with  appropriate  human-machine

I don't really know how we could extract that without potentially messing other inputs (and creating a pdf-specific split doesn't seem like a good solution)

@stefanfrench
Copy link
Contributor

@daavoo - Yes I'm actually getting the same with main , however from memory I recall getting better results with these pdfs from a previous commit.

I think it would be worth extending the splitter with a regex pattern matching a line with all caps and nothing else so that it works for the LORA type pdf scenarios. I think this is probably common enough to warrant the effort.

Lets ignore the EU AI Act format as I think its an edge case. We could also have a 'needs help' issue for improving this pre-processing in general and see if the community has any ideas of how to improve.

Defintely worth mentioning these challenges in the Blog post as I would have thought it would be more solved.

We could also take a look at marker, do you think this would help at all or would we run into the same issues?

@daavoo
Copy link
Contributor Author

daavoo commented Jan 24, 2025

@daavoo - Yes I'm actually getting the same with main , however from memory I recall getting better results with these pdfs from a previous commit.

It might have been back when the pdf to markdown was done with docling. It was better for the LORA (headings were detected) but worse for many of the rest.

I think it would be worth extending the splitter with a regex pattern matching a line with all caps and nothing else so that it works for the LORA type pdf scenarios. I think this is probably common enough to warrant the effort.

Would give it a try.

Lets ignore the EU AI Act format as I think its an edge case. We could also have a 'needs help' issue for improving this pre-processing in general and see if the community has any ideas of how to improve.

Defintely worth mentioning these challenges in the Blog post as I would have thought it would be more solved.

Will do.

We could also take a look at marker, do you think this would help at all or would we run into the same issues?

I will give it a try. I was trying to avoid "expensive" preprocessing as marker and docling (optionally) use CV models to identify the layout. Might just include them as an option in the benchmark

Copy link
Contributor

@stefanfrench stefanfrench left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo Based on re-testing this after re-installing dependencies, this worked well for the Lora paper. Lets proceed without Marker for now.

@daavoo daavoo merged commit d1675cf into main Jan 27, 2025
3 checks passed
@daavoo daavoo deleted the 13-implement-alternative-to-markdownheadertextsplitter branch January 27, 2025 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement alternative to MarkdownHeaderTextSplitter
2 participants