How to get nodes relationships with RecursiveCharacterTextSplitter #29899

Danana1 · 2025-02-20T07:44:14Z

Danana1
Feb 20, 2025

I use RecursiveCharacterTextSplitter to read code files (Python, CPP, JAVA etc).:

python_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.PYTHON,
chunk_size=chunk_size,
chunk_overlap=chunk_overlap)
lc_nodes = python_splitter.split_documents(docs)

When observing the split nodes, the metadata contains only: source, content_type, and language.
I also want to get the nodes' relationship. meaning, if a file was split into a few nodes, I want to know the order of the nodes. and if possible, the start char index and end char index.
How can I do that in LangChain RecursiveCharacterTextSplitter? or perhaps there is an alternative?

Answered by dosubot[bot]

Feb 20, 2025

I found a similar unsolved discussion on this topic: How to get nodes relationships with RecursiveCharacterTextSplitter, which was last updated on February 18, 2025 ^[1].

To obtain the order of nodes and their start and end character indices when using RecursiveCharacterTextSplitter, you can use the add_start_index parameter. This parameter, when set to True, will add the start index of each chunk in the metadata of the resulting Document objects. Here's an example:

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Example text
text = "w1 w1 w1 w1 w1 w1 w1 w1 w1"

# Initialize the RecursiveCharacterTextSplitter with add_st…

View full answer

@dosu · 2025-02-20T07:46:26Z

dosubot[bot]
bot Feb 20, 2025

I found a similar unsolved discussion on this topic: How to get nodes relationships with RecursiveCharacterTextSplitter, which was last updated on February 18, 2025 ^[1].

To obtain the order of nodes and their start and end character indices when using RecursiveCharacterTextSplitter, you can use the add_start_index parameter. This parameter, when set to True, will add the start index of each chunk in the metadata of the resulting Document objects. Here's an example:

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Example text
text = "w1 w1 w1 w1 w1 w1 w1 w1 w1"

# Initialize the RecursiveCharacterTextSplitter with add_start_index=True
splitter = RecursiveCharacterTextSplitter(
    chunk_size=6,
    chunk_overlap=0,
    separators=["\n\n", "\n", " ", ""],
    add_start_index=True,
)

# Split the text
docs = splitter.create_documents([text])

# Output the documents with their start indices
for doc in docs:
    start_index = doc.metadata["start_index"]
    print(f"Content: {doc.page_content}, Start Index: {start_index}")

This will give you the content of each chunk along with its start index in the original text. The end index can be calculated by adding the length of the chunk to the start index ^[2].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get nodes relationships with RecursiveCharacterTextSplitter #29899

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How to get nodes relationships with RecursiveCharacterTextSplitter #29899

Danana1 Feb 20, 2025

Replies: 1 comment

dosubot[bot] bot Feb 20, 2025

Danana1
Feb 20, 2025

dosubot[bot]
bot Feb 20, 2025