-
Notifications
You must be signed in to change notification settings - Fork 0
feat: trying to fix the no doc_ref_id error on loading documents! #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes update the Changes
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
hivemind_etl/mediawiki/etl.py (1)
40-87: Consider adding validation for page.page_id.While the current implementation should fix the immediate issue, it might be worth adding validation to ensure page.page_id is not None or empty before using it as doc_id and ref_doc_id.
# Generate a ref_doc_id if needed for newer llama-index versions doc_id = page.page_id + if not doc_id: + logging.warning(f"Missing page_id for page with title: {page.title}, generating fallback ID") + doc_id = f"fallback_{hash(page.title)}_{hash(page.revision.text[:100])}" documents.append( Document(
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
hivemind_etl/mediawiki/etl.py(2 hunks)requirements.txt(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: ci / lint / Lint
- GitHub Check: ci / test / Test
🔇 Additional comments (4)
requirements.txt (1)
2-2: Appropriate version update to resolve the dependency issue.Upgrading
tc-hivemind-backendfrom 1.4.0 to 1.4.2.post2 aligns with the changes made in the ETL code to support the proper document reference ID handling. This should address the "no doc_ref_id error" mentioned in the PR title.hivemind_etl/mediawiki/etl.py (3)
45-47: Good introduction of doc_id variable with clear comment.The added comment clearly explains the purpose of this change - to support newer llama-index versions that require a reference document ID. Creating a local variable improves code readability.
49-50: LGTM: Proper usage of the doc_id variable.Using the local variable instead of directly referencing page.page_id improves consistency and readability.
62-63: Key fix: Added ref_doc_id to metadata.This change directly addresses the PR objective by adding the reference document ID to the document metadata, which will prevent the "no doc_ref_id error" during document loading.
Summary by CodeRabbit
Chores
tc-hivemind-backendto 1.4.2.post2.Refactor