enh(preprocessing): Add split_markdown_by_headings. #14

daavoo · 2025-01-22T10:32:38Z

Replaces https://pypi.org/project/langchain-text-splitters/

stefanfrench

@daavoo - tested this out.

For some of my pdfs, this change did improve the sections retrieved.

For a couple of them, I got worse results:

LORA-paper.pdf:

["epoch 5","epoch 2","lora: low-rank adaptation of large lan### guage models","of trainable parameters = 18m"]

EU AI Act.pdf (this is a very long doc):

["2024_1689 12.7.2024","en oj l, 12.7.2024"]

I think it could be worth investigating and adding some logic to deal with these cases. WDYT?

daavoo · 2025-01-23T17:57:56Z

For a couple of them, I got worse results:

@stefanfrench it is worse than main branch?
I get the same results with https://pypi.org/project/langchain-text-splitters/ .

I think it could be worth investigating and adding some logic to deal with these cases. WDYT?

I think the root cause for these is not the "splitter" but rather the pdf to markdown conversion.

I have tried pymupdf4llm and markitdown and they both fail to generate markdown headers for these PDFs.

For the LORA paper, they create a markdown where the header is represented like:

ABSTRACT

An important paradigm of natural language processing consists of large-scale pre-
training on general domain data and adaptation to particular tasks or domains. As
...

1

INTRODUCTION

Many applications in natural language processing rely on adapt-
ing one large-scale, pre-trained language model to multiple down-
...

Do you think it would be worth trying to extend the splitter with a regex pattern matching a line with all caps and nothing else?

For the EU AI Act, I don't really know why but the markdown is even trickier as it looks like:

properly  collect,  store  and  interpret  the  logs  in  accordance  with  Article  12.

Article  14

Human  oversight

High-risk  AI  systems  shall  be  designed  and  developed  in  such  a  way,  including  with  appropriate  human-machine

I don't really know how we could extract that without potentially messing other inputs (and creating a pdf-specific split doesn't seem like a good solution)

stefanfrench · 2025-01-24T10:30:03Z

@daavoo - Yes I'm actually getting the same with main , however from memory I recall getting better results with these pdfs from a previous commit.

I think it would be worth extending the splitter with a regex pattern matching a line with all caps and nothing else so that it works for the LORA type pdf scenarios. I think this is probably common enough to warrant the effort.

Lets ignore the EU AI Act format as I think its an edge case. We could also have a 'needs help' issue for improving this pre-processing in general and see if the community has any ideas of how to improve.

Defintely worth mentioning these challenges in the Blog post as I would have thought it would be more solved.

We could also take a look at marker, do you think this would help at all or would we run into the same issues?

daavoo · 2025-01-24T10:42:22Z

@daavoo - Yes I'm actually getting the same with main , however from memory I recall getting better results with these pdfs from a previous commit.

It might have been back when the pdf to markdown was done with docling. It was better for the LORA (headings were detected) but worse for many of the rest.

I think it would be worth extending the splitter with a regex pattern matching a line with all caps and nothing else so that it works for the LORA type pdf scenarios. I think this is probably common enough to warrant the effort.

Would give it a try.

Lets ignore the EU AI Act format as I think its an edge case. We could also have a 'needs help' issue for improving this pre-processing in general and see if the community has any ideas of how to improve.

Defintely worth mentioning these challenges in the Blog post as I would have thought it would be more solved.

Will do.

We could also take a look at marker, do you think this would help at all or would we run into the same issues?

I will give it a try. I was trying to avoid "expensive" preprocessing as marker and docling (optionally) use CV models to identify the layout. Might just include them as an option in the benchmark

stefanfrench

@daavoo Based on re-testing this after re-installing dependencies, this worked well for the Lora paper. Lets proceed without Marker for now.

enh(preprocessing): Add split_markdown_by_headings.

d57c367

daavoo requested a review from a team January 22, 2025 10:32

daavoo self-assigned this Jan 22, 2025

daavoo linked an issue Jan 22, 2025 that may be closed by this pull request

Implement alternative to MarkdownHeaderTextSplitter #13

Closed

stefanfrench reviewed Jan 23, 2025

View reviewed changes

stefanfrench approved these changes Jan 27, 2025

View reviewed changes

daavoo merged commit d1675cf into main Jan 27, 2025
3 checks passed

daavoo deleted the 13-implement-alternative-to-markdownheadertextsplitter branch January 27, 2025 12:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

enh(preprocessing): Add split_markdown_by_headings. #14

enh(preprocessing): Add split_markdown_by_headings. #14

Uh oh!

daavoo commented Jan 22, 2025

Uh oh!

stefanfrench left a comment

Uh oh!

daavoo commented Jan 23, 2025 •

edited

Loading

Uh oh!

stefanfrench commented Jan 24, 2025

Uh oh!

daavoo commented Jan 24, 2025

Uh oh!

stefanfrench left a comment

Uh oh!

Uh oh!

Uh oh!

enh(preprocessing): Add split_markdown_by_headings. #14

enh(preprocessing): Add split_markdown_by_headings. #14

Uh oh!

Conversation

daavoo commented Jan 22, 2025

Uh oh!

stefanfrench left a comment

Choose a reason for hiding this comment

Uh oh!

daavoo commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefanfrench commented Jan 24, 2025

Uh oh!

daavoo commented Jan 24, 2025

Uh oh!

stefanfrench left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

daavoo commented Jan 23, 2025 •

edited

Loading