-
I use RecursiveCharacterTextSplitter to read code files (Python, CPP, JAVA etc).: python_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.PYTHON, When observing the split nodes, the metadata contains only: source, content_type, and language. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I found a similar unsolved discussion on this topic: How to get nodes relationships with RecursiveCharacterTextSplitter, which was last updated on February 18, 2025 [1]. To obtain the order of nodes and their start and end character indices when using from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
# Example text
text = "w1 w1 w1 w1 w1 w1 w1 w1 w1"
# Initialize the RecursiveCharacterTextSplitter with add_start_index=True
splitter = RecursiveCharacterTextSplitter(
chunk_size=6,
chunk_overlap=0,
separators=["\n\n", "\n", " ", ""],
add_start_index=True,
)
# Split the text
docs = splitter.create_documents([text])
# Output the documents with their start indices
for doc in docs:
start_index = doc.metadata["start_index"]
print(f"Content: {doc.page_content}, Start Index: {start_index}") This will give you the content of each chunk along with its start index in the original text. The end index can be calculated by adding the length of the chunk to the start index [2]. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other |
Beta Was this translation helpful? Give feedback.
I found a similar unsolved discussion on this topic: How to get nodes relationships with RecursiveCharacterTextSplitter, which was last updated on February 18, 2025 [1].
To obtain the order of nodes and their start and end character indices when using
RecursiveCharacterTextSplitter
, you can use theadd_start_index
parameter. This parameter, when set toTrue
, will add the start index of each chunk in the metadata of the resultingDocument
objects. Here's an example: