feat: limit batch size to 1!

amindadgar · amindadgar · commit 0185c7ff7ded · 2025-06-29T10:02:25.000+03:30
As a temporary fix to llama-index first loading into vectorstore issue, we limit the batch size to 1.

The issue described:
In llama-index pipeline when trying to load documents into vectorstore, it first loads into docstore and then into vectorstore.
In any case problems raised while loading into docstore the data would be missed to be loaded into vectorstore. So we limit the batch size to 1 meaning the data will be 1 by 1 loaded into docstore + vectorstore.
diff --git a/hivemind_etl/mediawiki/etl.py b/hivemind_etl/mediawiki/etl.py
@@ -103,7 +103,7 @@ def load(self, documents: list[Document]) -> None:
         )
         
         # Process batches in parallel using ThreadPoolExecutor
-        batch_size = 1000
+        batch_size = 1
         batches = [documents[i:i + batch_size] for i in range(0, len(documents), batch_size)]
         
         with ThreadPoolExecutor(max_workers=10) as executor:

Original file line number	Diff line number	Diff line change
`@@ -103,7 +103,7 @@ def load(self, documents: list[Document]) -> None:`
`103`	`103`	`)`
`104`	`104`
`105`	`105`	`# Process batches in parallel using ThreadPoolExecutor`
`106`		`- batch_size = 1000`
	`106`	`+ batch_size = 1`
`107`	`107`	`batches = [documents[i:i + batch_size] for i in range(0, len(documents), batch_size)]`
`108`	`108`
`109`	`109`	`with ThreadPoolExecutor(max_workers=10) as executor:`