-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Add GroundX doc processing pipeline files #182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughIntroduces a new Streamlit-based Ground X document analysis app with chat, supporting utilities for Ground X ingestion/X-Ray retrieval, an evaluation pipeline using GEval and GPT-4o comparison, a CLI to run evaluations, packaging/configuration files, environment variable examples, and a Beam-based server launcher. No existing code modified. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Streamlit App
participant GroundX Utils
participant Ground X API
participant Ollama LLM
User->>Streamlit App: Upload/select document
Streamlit App->>GroundX Utils: process_document(file)
GroundX Utils->>Ground X API: ingest + poll status
Ground X API-->>GroundX Utils: processing complete + xray_url
GroundX Utils->>Ground X API: fetch X-Ray JSON
Ground X API-->>GroundX Utils: xray_data
GroundX Utils-->>Streamlit App: xray_data, used_existing
Streamlit App->>User: Show X-Ray tabs and summaries
User->>Streamlit App: Ask question in Chat
Streamlit App->>Streamlit App: prepare_chat_context(xray_data, prompt)
Streamlit App->>Ollama LLM: generate(prompt, context)
Ollama LLM-->>Streamlit App: response
Streamlit App-->>User: Chat answer
sequenceDiagram
autonumber
participant CLI
participant EvaluatorGEval
participant Ground X API
participant OpenAI API
participant GEval (Opik)
CLI->>EvaluatorGEval: run(invoices, questions)
loop per invoice
EvaluatorGEval->>Ground X API: ingest + poll + fetch X-Ray
Ground X API-->>EvaluatorGEval: xray_data
EvaluatorGEval->>OpenAI API: reasoning (image/PDF as needed)
OpenAI API-->>EvaluatorGEval: model_output
EvaluatorGEval->>GEval (Opik): score(model_output vs expected)
GEval (Opik)-->>EvaluatorGEval: score, reason
end
EvaluatorGEval-->>CLI: results and averages
CLI-->>CLI: print breakdown
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 18
🧹 Nitpick comments (18)
groundX-doc-pipeline/.env.example (1)
1-3: Add SAMPLE_FILE_NAME and address linter nit (end newline).The app reads SAMPLE_FILE_NAME (Line 213 in app.py), but it's not present here. Also, add a trailing newline to satisfy dotenv-linter.
Apply this diff:
GROUNDX_API_KEY=your-groundx-api-key-here OPENAI_API_KEY=your-openai-api-key-here COMET_API_KEY=your-comet-api-key-here +SAMPLE_FILE_NAME=sample-file.pdf +Note: Keep this file as an example and never commit real secrets.
groundX-doc-pipeline/server.py (1)
3-17: Environment variables not passed to the Pod (GROUNDX_API_KEY, OPENAI_API_KEY).The app loads API keys from environment, but the Pod doesn’t receive them. Without passing env vars, the app will likely fail at client creation.
If Pod supports env passthrough, add something like:
# Example (adjust to your Beam API): streamlit_server = Pod( image=Image().add_python_packages([...]), ports=[8501], cpu=4, memory="2Gi", entrypoint=["streamlit", "run", "groundX-doc-pipeline/app.py"], env={ "GROUNDX_API_KEY": os.environ.get("GROUNDX_API_KEY", ""), "OPENAI_API_KEY": os.environ.get("OPENAI_API_KEY", ""), "COMET_API_KEY": os.environ.get("COMET_API_KEY", ""), "SAMPLE_FILE_NAME": os.environ.get("SAMPLE_FILE_NAME", ""), }, )If Beam has a first-class Secrets mechanism, prefer that over env to avoid accidental leakage.
groundX-doc-pipeline/groundx_utils.py (1)
2-2: Remove unused import.
tempfileis imported but unused.Apply this diff:
-import tempfilegroundX-doc-pipeline/app.py (4)
1-5: Remove unused typing imports.
AnyandDictare not used.Apply this diff:
-import os -import tempfile -import requests -from typing import Any, Dict +import os +import tempfile +import requests
414-418: Collapse nested context managers for clarity.Use a single with statement.
Apply this diff:
- with st.chat_message("assistant"): - with st.spinner("Thinking..."): - context = prepare_chat_context(xray, prompt) - response = generate_chat_response(prompt, context) - st.session_state.chat_history.append({"role": "assistant", "content": response}) - st.markdown(response) + with st.chat_message("assistant"), st.spinner("Thinking..."): + context = prepare_chat_context(xray, prompt) + response = generate_chat_response(prompt, context) + st.session_state.chat_history.append({"role": "assistant", "content": response}) + st.markdown(response)
114-140: Ollama-only chat backend may fail in hosted environments; add a fallback.The Beam Pod likely won’t have an Ollama daemon on localhost. Provide a fallback to OpenAI if OPENAI_API_KEY is set; otherwise, retain Ollama.
I can prepare a small helper to detect availability and route accordingly. Do you want me to add that with minimal config toggles?
18-52: CSS overrides depend on private Streamlit classes and may break across versions.These class names are not part of the public API. Consider keeping the overrides minimal or feature-gating them.
If you’ve verified against your target Streamlit version, feel free to ignore. Otherwise, guard with a comment and be prepared to adjust on upgrades.
groundX-doc-pipeline/README.md (3)
19-21: Clarify dependency installation (uv vs pip) and prerequisites“uv sync” assumes uv is installed. Add a quick note or fallback to pip so users aren’t blocked.
Proposed wording (no diff since it adds lines):
- Add a line above the code fence: “If you don’t have uv installed, see https://docs.astral.sh/uv/ or run pip install -U uv.”
- Or add an alternative block: “pip install -r requirements.txt (or pip install -e . if using pyproject).”
37-41: Add a cd step to avoid path confusionMake it explicit you should run from the project directory.
-```bash -streamlit run app.py -``` +```bash +cd groundX-doc-pipeline +streamlit run app.py +```
45-56: Fix MD040: add language to fenced blockThe project structure fence should specify a language (e.g., text).
-``` +```text groundX-doc-pipeline/ ├── app.py # Main Streamlit application (uses groundx_utils.py) ├── groundx_utils.py # Utility functions for Ground X operations ├── .env # Environment variables (create from .env.example) ├── file/ # Folder containing files for running evaluation └── README.md # This file 📁 Evaluation Tools: ├── evaluation_geval.py # GEval framework evaluation └── run_evaluation_cli.py # CLI evaluation runner -``` +```groundX-doc-pipeline/run_evaluation_cli.py (4)
16-19: Deterministic file ordering for reproducible evaluationsIterdir order is not guaranteed. Sort by filename for consistent runs.
- files = [p for p in folder.iterdir() if p.is_file() and p.suffix.lower() in ALLOWED_EXTENSIONS] + files = sorted( + (p for p in folder.iterdir() if p.is_file() and p.suffix.lower() in ALLOWED_EXTENSIONS), + key=lambda p: p.name.lower() + )
141-151: Simplify exact/partial matching branches (SIM114)Combine the two branches to reduce duplication.
- # Exact match validation - if q_lower == expected_q_lower: - inv["expected_outputs"][q] = expected_a - matched = True - break - - # Partial match validation using key identifying words - elif _smart_partial_match(q_lower, expected_q_lower): - inv["expected_outputs"][q] = expected_a - matched = True - break + # Exact or partial match + if q_lower == expected_q_lower or _smart_partial_match(q_lower, expected_q_lower): + inv["expected_outputs"][q] = expected_a + matched = True + break
178-180: Remove extraneous f-prefix in f-string (F541)No placeholders present.
- print(f"\nDetailed Results:") + print("\nDetailed Results:")
129-156: Brittle dataset-key detection by exact filename baseMatching only when Path(...).stem equals "electricity" or "energy-plus" is fragile. Consider a contains-based match so “electricity-invoice-1.pdf” still maps.
- base = Path(inv["name"]).stem.lower() - if base in EXPECTED_ANSWERS: + base = Path(inv["name"]).stem.lower() + dataset_key = next((k for k in EXPECTED_ANSWERS if k in base), None) + if dataset_key: - for q in questions: + for q in questions: @@ - for expected_q, expected_a in EXPECTED_ANSWERS[base].items(): + for expected_q, expected_a in EXPECTED_ANSWERS[dataset_key].items():groundX-doc-pipeline/evaluation_geval.py (4)
17-38: Hard fail on empty expected_output may be too strictThe evaluator raises if expected_output is empty. The CLI fills defaults, but if this function is reused elsewhere it’ll return an error pathway. Consider treating empty expected_output as “data not available” to avoid hard failures.
- if not model_output or not expected_output: - raise ValueError("Model output and expected output cannot be empty") + if not model_output: + raise ValueError("Model output cannot be empty") + if not expected_output: + expected_output = "data not available"If you intended the strict behavior, ignore this suggestion.
245-245: Avoid double dataset clear/insert; insert once after enriching samplesYou clear/insert before and after evaluation. Keep a single insert at the end to reduce churn (and potential race conditions if datasets are viewed live).
- dataset.clear(); dataset.insert(samples) + # Defer dataset insert until after evaluation so samples include scores and outputs.
261-261: Rename unused loop variable (B007)Index isn’t used in this loop.
- for i, sample in enumerate(samples): + for _i, sample in enumerate(samples):
179-179: Remove extraneous f-prefix in debug print (F541)Minor cleanup.
- print(f"DEBUG: Using image file directly") + print("DEBUG: Using image file directly")Note: The broader fix in the PDF block comment also addresses similar prints.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these settings in your CodeRabbit configuration.
⛔ Files ignored due to path filters (4)
groundX-doc-pipeline/assets/groundx.pngis excluded by!**/*.pnggroundX-doc-pipeline/file/electricity.pdfis excluded by!**/*.pdfgroundX-doc-pipeline/file/energy-plus.pdfis excluded by!**/*.pdfgroundX-doc-pipeline/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (8)
groundX-doc-pipeline/.env.example(1 hunks)groundX-doc-pipeline/README.md(1 hunks)groundX-doc-pipeline/app.py(1 hunks)groundX-doc-pipeline/evaluation_geval.py(1 hunks)groundX-doc-pipeline/groundx_utils.py(1 hunks)groundX-doc-pipeline/pyproject.toml(1 hunks)groundX-doc-pipeline/run_evaluation_cli.py(1 hunks)groundX-doc-pipeline/server.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (4)
groundX-doc-pipeline/run_evaluation_cli.py (1)
groundX-doc-pipeline/evaluation_geval.py (3)
create_evaluator_geval(288-295)process_invoice(117-127)run(229-285)
groundX-doc-pipeline/app.py (1)
groundX-doc-pipeline/groundx_utils.py (5)
create_client(18-22)ensure_bucket(25-34)check_file_exists(115-121)get_xray_for_existing_document(123-136)process_document(138-159)
groundX-doc-pipeline/server.py (2)
imagegen-janus-pro/app.py (1)
main(198-264)audio-analysis-toolkit/app.py (1)
main(662-810)
groundX-doc-pipeline/evaluation_geval.py (2)
gpt-oss-vs-qwen3/code_evaluation_opik.py (1)
evaluate_reasoning(4-212)code-model-comparison/code_evaluation_opik.py (1)
evaluate_code(4-172)
🪛 Ruff (0.12.2)
groundX-doc-pipeline/run_evaluation_cli.py
4-4: typing.Any imported but unused
Remove unused import: typing.Any
(F401)
142-151: Combine if branches using logical or operator
Combine if branches
(SIM114)
179-179: f-string without any placeholders
Remove extraneous f prefix
(F541)
225-227: Use a single if statement instead of nested if statements
(SIM102)
groundX-doc-pipeline/groundx_utils.py
2-2: tempfile imported but unused
Remove unused import: tempfile
(F401)
groundX-doc-pipeline/app.py
4-4: typing.Any imported but unused
Remove unused import
(F401)
4-4: typing.Dict imported but unused
Remove unused import
(F401)
236-236: Use a context manager for opening files
(SIM115)
414-415: Use a single with statement with multiple contexts instead of nested with statements
(SIM117)
groundX-doc-pipeline/evaluation_geval.py
12-12: opik.evaluation.evaluate imported but unused
Remove unused import: opik.evaluation.evaluate
(F401)
13-13: opik.evaluation.metrics.BaseMetric imported but unused
Remove unused import: opik.evaluation.metrics.BaseMetric
(F401)
179-179: f-string without any placeholders
Remove extraneous f prefix
(F541)
186-186: f-string without any placeholders
Remove extraneous f prefix
(F541)
215-215: f-string without any placeholders
Remove extraneous f prefix
(F541)
217-217: f-string without any placeholders
Remove extraneous f prefix
(F541)
245-245: Multiple statements on one line (semicolon)
(E702)
261-261: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
🪛 LanguageTool
groundX-doc-pipeline/README.md
[grammar] ~7-~7: There might be a mistake here.
Context: ...A document processing and X-Ray analysis - Streamlit for the UI - Ollama for servin...
(QB_NEW_EN)
[grammar] ~8-~8: There might be a mistake here.
Context: ...nd X-Ray analysis - Streamlit for the UI - Ollama for serving LLM locally --- ## ...
(QB_NEW_EN)
[grammar] ~61-~61: There might be a mistake here.
Context: ...the document to be processed by Ground X 3. Explore the X-Ray analysis results in di...
(QB_NEW_EN)
[grammar] ~62-~62: There might be a mistake here.
Context: ...-Ray analysis results in different tabs: - JSON Output: Raw analysis data - Narr...
(QB_NEW_EN)
[grammar] ~63-~63: There might be a mistake here.
Context: ...abs: - JSON Output: Raw analysis data - Narrative Summary: Extracted narratives ...
(QB_NEW_EN)
[grammar] ~64-~64: There might be a mistake here.
Context: ... Narrative Summary: Extracted narratives - File Summary: Document overview - Sug...
(QB_NEW_EN)
[grammar] ~65-~65: There might be a mistake here.
Context: ...ves - File Summary: Document overview - Suggested Text: AI-suggested content ...
(QB_NEW_EN)
[grammar] ~66-~66: There might be a mistake here.
Context: ... - Suggested Text: AI-suggested content - Extracted Text: Raw text extraction -...
(QB_NEW_EN)
[grammar] ~67-~67: There might be a mistake here.
Context: ... - Extracted Text: Raw text extraction - Keywords: Document keywords 4. Use the c...
(QB_NEW_EN)
[grammar] ~68-~68: There might be a mistake here.
Context: ...raction - Keywords: Document keywords 4. Use the chat interface to ask questions ...
(QB_NEW_EN)
[grammar] ~75-~75: There might be a mistake here.
Context: ...ucket creation and document organization - Document Ingestion: Support for PDF, W...
(QB_NEW_EN)
[grammar] ~76-~76: There might be a mistake here.
Context: ...ort for PDF, Word docs, images, and more - X-Ray Analysis: Rich structured data w...
(QB_NEW_EN)
[grammar] ~77-~77: There might be a mistake here.
Context: ...ies, page chunks, keywords, and metadata - Context Engineering: Intelligent conte...
(QB_NEW_EN)
[grammar] ~78-~78: There might be a mistake here.
Context: ...gent context preparation for LLM queries - AI Chat Interface: Interactive Q&A pow...
(QB_NEW_EN)
[grammar] ~83-~83: There might be a mistake here.
Context: ... ## 📬 Stay Updated with Our Newsletter! Get a FREE Data Science eBook 📖 with 1...
(QB_NEW_EN)
[grammar] ~85-~85: There might be a mistake here.
Context: ...📖 with 150+ essential lessons in Data Science when you subscribe to our newsletter! St...
(QB_NEW_EN)
[style] ~93-~93: Using many exclamation marks might seem excessive (in this case: 4 exclamation marks for a text that’s 2081 characters long)
Context: ... Contribution Contributions are welcome! Please fork the repository and submit a...
(EN_EXCESSIVE_EXCLAMATION)
🪛 markdownlint-cli2 (0.17.2)
groundX-doc-pipeline/README.md
25-25: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
45-45: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 dotenv-linter (3.3.0)
groundX-doc-pipeline/.env.example
[warning] 3-3: [EndingBlankLine] No blank line at the end of the file
(EndingBlankLine)
[warning] 3-3: [UnorderedKey] The COMET_API_KEY key should go before the GROUNDX_API_KEY key
(UnorderedKey)
🔇 Additional comments (5)
groundX-doc-pipeline/server.py (1)
4-12: Confirm Beam pip package name alignment with pyproject.Code imports from beam, while pyproject declares beam-client. This mismatch will cause ImportError at runtime.
Ensure the dependency name matches the import. See comment in pyproject.toml for the fix.
groundX-doc-pipeline/pyproject.toml (1)
11-19: Incorrect — beam vs beam-client and OpenAI version are valid in this repo; do not apply the suggested diffShort check summary:
- The repo declares and locks the packages the code uses — the original suggestions are not required.
Files/locations to note:
- groundX-doc-pipeline/pyproject.toml — contains "openai>=1.99.9" and "beam-client>=0.1.0".
- groundX-doc-pipeline/server.py — imports "from beam import Image, Pod" (line 1).
- groundX-doc-pipeline/uv.lock — shows beam_client distribution (sdist/wheel, e.g. beam_client-0.2.175) and openai-1.99.9 (sdist/wheel, upload-time entries present).
Conclusion:
- The dependency name mismatch flagged in the original comment is not actionable here (the lockfile contains the beam-client artifact and the code imports beam).
- The OpenAI version (1.99.9) appears in the lockfile (sdist/wheel entries), so it is not an invalid/unreleased reference in this repo.
Action: ignore the proposed diff. If you intended a stricter OpenAI constraint or a different Beam package, explain the target (PyPI package vs import name) and I’ll re-check for a safe change.
Likely an incorrect or invalid review comment.
groundX-doc-pipeline/groundx_utils.py (1)
36-51: Validatefile_typederivation for non-PDFs (docx mime).For DOCX,
file_to_process.typeis typicallyapplication/vnd.openxmlformats-officedocument.wordprocessingml.document, sofile_typebecomes a long vendor string. Confirm Ground X expects that value vs.docx.If Ground X requires extensions, map common MIME types to expected short types before calling
Document(...). I can provide a mapper if needed.groundX-doc-pipeline/app.py (1)
210-228: Default SAMPLE_FILE_NAME looks auto-generated and not documented.The default
tmpivkf8qf8_sample-file.pdfis likely a placeholder. It’s better to rely on an env var and document it in.env.example(see suggested addition there).If you want a more discoverable default, consider
sample-file.pdfand ship a sample in assets/.groundX-doc-pipeline/README.md (1)
29-35: LGTM on Ollama setup stepsThe steps to pull and serve the model are clear and accurate.
| # Initialize Streamlit session state | ||
| for key in ["xray_data", "uploaded_file_path", "uploaded_file_name", "uploaded_file_type", "processing_complete", "used_existing_file", "auto_loaded_file"]: | ||
| if key not in st.session_state: | ||
| st.session_state[key] = None if key == "xray_data" else False | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Initialize session state with correct default types per key.
Using False for string-typed keys (e.g., uploaded_file_name) is confusing. Initialize explicitly.
Apply this diff:
-for key in ["xray_data", "uploaded_file_path", "uploaded_file_name", "uploaded_file_type", "processing_complete", "used_existing_file", "auto_loaded_file"]:
- if key not in st.session_state:
- st.session_state[key] = None if key == "xray_data" else False
+st.session_state.setdefault("xray_data", None)
+st.session_state.setdefault("uploaded_file_path", None)
+st.session_state.setdefault("uploaded_file_name", None)
+st.session_state.setdefault("uploaded_file_type", None)
+st.session_state.setdefault("processing_complete", False)
+st.session_state.setdefault("used_existing_file", False)
+st.session_state.setdefault("auto_loaded_file", False)
+st.session_state.setdefault("chat_history", [])📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Initialize Streamlit session state | |
| for key in ["xray_data", "uploaded_file_path", "uploaded_file_name", "uploaded_file_type", "processing_complete", "used_existing_file", "auto_loaded_file"]: | |
| if key not in st.session_state: | |
| st.session_state[key] = None if key == "xray_data" else False | |
| # Initialize Streamlit session state | |
| st.session_state.setdefault("xray_data", None) | |
| st.session_state.setdefault("uploaded_file_path", None) | |
| st.session_state.setdefault("uploaded_file_name", None) | |
| st.session_state.setdefault("uploaded_file_type", None) | |
| st.session_state.setdefault("processing_complete", False) | |
| st.session_state.setdefault("used_existing_file", False) | |
| st.session_state.setdefault("auto_loaded_file", False) | |
| st.session_state.setdefault("chat_history", []) |
🤖 Prompt for AI Agents
In groundX-doc-pipeline/app.py around lines 141 to 145, the session state
initialization sets many string-typed keys to False which is confusing; update
the initialization to assign explicit defaults per key (e.g., xray_data = None,
uploaded_file_path = None or "" depending on how code treats absent path,
uploaded_file_name = "", uploaded_file_type = "", processing_complete = False,
used_existing_file = False, auto_loaded_file = False) by replacing the generic
loop with explicit st.session_state assignments for each key so each key has the
correct default type.
| tmp_file = tempfile.NamedTemporaryFile(delete=False, suffix=f"_{uploaded.name}") | ||
| tmp_file.write(uploaded.getbuffer()) | ||
| tmp_file.close() | ||
| st.session_state.uploaded_file_path = tmp_file.name | ||
| elif st.session_state.uploaded_file_path and not st.session_state.processing_complete: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use context manager when writing the uploaded file to disk.
Avoid leaving file handles open and ensure proper cleanup.
Apply this diff:
- tmp_file = tempfile.NamedTemporaryFile(delete=False, suffix=f"_{uploaded.name}")
- tmp_file.write(uploaded.getbuffer())
- tmp_file.close()
- st.session_state.uploaded_file_path = tmp_file.name
+ with tempfile.NamedTemporaryFile(delete=False, suffix=f"_{uploaded.name}") as tmp_file:
+ tmp_file.write(uploaded.getbuffer())
+ st.session_state.uploaded_file_path = tmp_file.name📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| tmp_file = tempfile.NamedTemporaryFile(delete=False, suffix=f"_{uploaded.name}") | |
| tmp_file.write(uploaded.getbuffer()) | |
| tmp_file.close() | |
| st.session_state.uploaded_file_path = tmp_file.name | |
| elif st.session_state.uploaded_file_path and not st.session_state.processing_complete: | |
| with tempfile.NamedTemporaryFile(delete=False, suffix=f"_{uploaded.name}") as tmp_file: | |
| tmp_file.write(uploaded.getbuffer()) | |
| st.session_state.uploaded_file_path = tmp_file.name | |
| elif st.session_state.uploaded_file_path and not st.session_state.processing_complete: |
🧰 Tools
🪛 Ruff (0.12.2)
236-236: Use a context manager for opening files
(SIM115)
🤖 Prompt for AI Agents
In groundX-doc-pipeline/app.py around lines 236 to 240, the code opens a
NamedTemporaryFile and writes to it without a context manager which can leave
file handles open; change to use a with NamedTemporaryFile(delete=False,
suffix=f"_{uploaded.name}") as tmp_file: write uploaded.getbuffer() to tmp_file,
flush if needed, then set st.session_state.uploaded_file_path = tmp_file.name so
the file is closed automatically and handles are cleaned up.
| from opik.evaluation import evaluate | ||
| from opik.evaluation.metrics import BaseMetric, GEval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove unused imports (Ruff F401)
evaluate and BaseMetric are imported but unused.
-from opik.evaluation import evaluate
-from opik.evaluation.metrics import BaseMetric, GEval
+from opik.evaluation.metrics import GEval📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from opik.evaluation import evaluate | |
| from opik.evaluation.metrics import BaseMetric, GEval | |
| from opik.evaluation.metrics import GEval |
🧰 Tools
🪛 Ruff (0.12.2)
12-12: opik.evaluation.evaluate imported but unused
Remove unused import: opik.evaluation.evaluate
(F401)
13-13: opik.evaluation.metrics.BaseMetric imported but unused
Remove unused import: opik.evaluation.metrics.BaseMetric
(F401)
🤖 Prompt for AI Agents
In groundX-doc-pipeline/evaluation_geval.py around lines 12 to 13, the imports
'evaluate' and 'BaseMetric' are unused (Ruff F401); remove those unused imports
and only keep the import(s) actually used (e.g., retain 'GEval' from
opik.evaluation.metrics) so the file imports are minimal and lint-clean.
| # ---------------- ingestion helpers -------------- | ||
| def process_invoice(self, file_path: str, file_name: str, mime_type: str) -> Dict[str, Any]: | ||
| ingest = self.gx.ingest(documents=[ | ||
| Document( | ||
| bucket_id=int(self.bucket_id), | ||
| file_name=file_name, | ||
| file_path=file_path, | ||
| file_type=mime_type.split("/")[-1], | ||
| ) | ||
| ]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use file extension for Document.file_type (fixes .docx and similar)
Using MIME subtype (e.g., vnd.openxmlformats-officedocument...) as file_type will likely break ingestion for docx. Use the filename extension instead.
- ingest = self.gx.ingest(documents=[
- Document(
- bucket_id=int(self.bucket_id),
- file_name=file_name,
- file_path=file_path,
- file_type=mime_type.split("/")[-1],
- )
- ])
+ file_ext = os.path.splitext(file_name)[1].lstrip(".").lower() or (mime_type.split("/")[-1] if mime_type else "")
+ ingest = self.gx.ingest(
+ documents=[
+ Document(
+ bucket_id=int(self.bucket_id),
+ file_name=file_name,
+ file_path=file_path,
+ file_type=file_ext,
+ )
+ ]
+ )Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In groundX-doc-pipeline/evaluation_geval.py around lines 116 to 125, the
Document.file_type is currently derived from the MIME subtype which can produce
long values like vnd.openxmlformats-officedocument and break ingestion for
formats like .docx; change it to extract the file extension from file_name
(e.g., use pathlib.Path(file_name).suffix or split on '.'), strip the leading
dot and lowercase the result (fallback to the mime_type subtype only if no
extension is present) and pass that cleaned extension as file_type.
| def _fetch_xray(self, expected_name: str): | ||
| docs = self.gx.documents.lookup(id=self.bucket_id).documents | ||
| doc = next((d for d in docs if getattr(d, "file_name", None) == expected_name), docs[0]) | ||
| if getattr(doc, "xray_url", None): | ||
| r = requests.get(doc.xray_url) | ||
| r.raise_for_status() | ||
| return r.json() | ||
| raise RuntimeError("X-Ray URL missing") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid wrong-document fallback and add HTTP timeout
Defaulting to the first document if the exact name isn’t found can fetch the wrong X-Ray. Also, requests without a timeout can hang.
- docs = self.gx.documents.lookup(id=self.bucket_id).documents
- doc = next((d for d in docs if getattr(d, "file_name", None) == expected_name), docs[0])
- if getattr(doc, "xray_url", None):
- r = requests.get(doc.xray_url)
+ docs = self.gx.documents.lookup(id=self.bucket_id).documents
+ doc = next((d for d in docs if getattr(d, "file_name", None) == expected_name), None)
+ if not doc:
+ raise RuntimeError(f"Document '{expected_name}' not found in bucket {self.bucket_id}")
+ if getattr(doc, "xray_url", None):
+ r = requests.get(doc.xray_url, timeout=30)
r.raise_for_status()
return r.json()
- raise RuntimeError("X-Ray URL missing")
+ raise RuntimeError(f"X-Ray URL missing for '{expected_name}'")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _fetch_xray(self, expected_name: str): | |
| docs = self.gx.documents.lookup(id=self.bucket_id).documents | |
| doc = next((d for d in docs if getattr(d, "file_name", None) == expected_name), docs[0]) | |
| if getattr(doc, "xray_url", None): | |
| r = requests.get(doc.xray_url) | |
| r.raise_for_status() | |
| return r.json() | |
| raise RuntimeError("X-Ray URL missing") | |
| def _fetch_xray(self, expected_name: str): | |
| docs = self.gx.documents.lookup(id=self.bucket_id).documents | |
| doc = next((d for d in docs if getattr(d, "file_name", None) == expected_name), None) | |
| if not doc: | |
| raise RuntimeError(f"Document '{expected_name}' not found in bucket {self.bucket_id}") | |
| if getattr(doc, "xray_url", None): | |
| r = requests.get(doc.xray_url, timeout=30) | |
| r.raise_for_status() | |
| return r.json() | |
| raise RuntimeError(f"X-Ray URL missing for '{expected_name}'") |
🤖 Prompt for AI Agents
In groundX-doc-pipeline/evaluation_geval.py around lines 141 to 149, stop
falling back to docs[0] when the expected file_name is missing and add an HTTP
timeout to requests.get: locate the document by exact file_name and if not found
raise a RuntimeError (include expected_name and bucket_id in the message), then
call requests.get(doc.xray_url, timeout=10) and r.raise_for_status() as before;
ensure you also raise a clear error if xray_url is missing on the matched
document.
| {name = "Your Name", email = "your.email@example.com"} | ||
| ] | ||
| readme = "README.md" | ||
| requires-python = ">=3.8.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Python version vs. type annotations mismatch.
Project requires Python >=3.8.1 but groundx_utils.py uses tuple[...] generics (Python 3.9+). Either:
- Bump requires-python to >=3.9, or
- Make the annotations 3.8-compatible (recommended for now).
If you want to keep 3.8 support, see the suggested fix in groundx_utils.py to switch to typing.Tuple[...].
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== pyproject.toml (groundX-doc-pipeline) =="
if [ -f groundX-doc-pipeline/pyproject.toml ]; then
sed -n '1,120p' groundX-doc-pipeline/pyproject.toml || true
else
echo "groundX-doc-pipeline/pyproject.toml not found"
fi
echo
echo "== locate groundx_utils.py =="
file=$(find . -type f -name 'groundx_utils.py' -print -quit || true)
if [ -n "$file" ]; then
echo "FOUND: $file"
echo "---- start file ----"
sed -n '1,300p' "$file" || true
echo "---- end file ----"
else
echo "groundx_utils.py not found via find. Searching repo for matches..."
file=$(rg --files 2>/dev/null | rg 'groundx_utils.py$' || true)
if [ -n "$file" ]; then
echo "FOUND (rg): $file"
sed -n '1,300p' "$file" || true
else
echo "groundx_utils.py not found in repo"
fi
fi
echo
echo "== occurrences of 'tuple[' across repo =="
rg -n --hidden --line-number "tuple\[" -S || true
echo
echo "== occurrences of 'from __future__ import annotations' across repo =="
rg -n --hidden --line-number "from __future__ import annotations" -S || trueLength of output: 8751
Action required — Python version mismatch: pyproject specifies >=3.8.1 but code uses PEP 585 annotations (tuple[...])
Either bump requires-python to >=3.9 or make annotations 3.8-compatible. Recommended: change the annotation to typing.Tuple[...] for immediate 3.8 support.
- Files to update:
- groundX-doc-pipeline/pyproject.toml — current: requires-python = ">=3.8.1"
- groundX-doc-pipeline/groundx_utils.py — return annotation uses tuple[Dict[str, Any], bool]
Suggested minimal fix (change typing import + return annotation):
--- a/groundX-doc-pipeline/groundx_utils.py
+++ b/groundX-doc-pipeline/groundx_utils.py
@@
-from typing import Any, Dict, Optional
+from typing import Any, Dict, Optional, Tuple
@@
-def process_document(gx: GroundX, bucket_id: str, file_to_process: Any, file_path: str) -> tuple[Dict[str, Any], bool]:
+def process_document(gx: GroundX, bucket_id: str, file_to_process: Any, file_path: str) -> Tuple[Dict[str, Any], bool]:Notes: I found tuple[...] only in groundX-doc-pipeline/groundx_utils.py; other files either already use typing.Tuple or use PEP563 elsewhere. Applying the patch above resolves the mismatch without raising the project Python minimum.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| requires-python = ">=3.8.1" | |
| from typing import Any, Dict, Optional, Tuple | |
| def process_document(gx: GroundX, bucket_id: str, file_to_process: Any, file_path: str) -> Tuple[Dict[str, Any], bool]: |
| Copy `.env.example` to `.env` and configure the following environment variables: | ||
|
|
||
| ``` | ||
| GROUNDX_API_KEY=your_groundx_api_key_here | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Document all required environment variables and specify fence language
The README only lists GROUNDX_API_KEY, but the evaluation flow requires OPENAI_API_KEY and optionally COMET_API_KEY. Also, markdownlint flags MD040 for this fence (missing language).
Apply this diff to fix both:
-Copy `.env.example` to `.env` and configure the following environment variables:
-
-```
-GROUNDX_API_KEY=your_groundx_api_key_here
-```
+Copy `.env.example` to `.env` and configure the following environment variables:
+
+```dotenv
+GROUNDX_API_KEY=your_groundx_api_key_here
+# Required for GPT-4o comparisons in evaluation_geval.py
+OPENAI_API_KEY=your_openai_api_key_here
+# Optional: needed only if you want Opik/Comet dataset logging
+COMET_API_KEY=your_comet_opik_api_key_here
+```🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
25-25: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In groundX-doc-pipeline/README.md around lines 23 to 27, the env instructions
only list GROUNDX_API_KEY and use an unlabelled code fence; update the section
to use a fenced block with language "dotenv" and document required/optional
variables by adding OPENAI_API_KEY (required for GPT-4o comparisons in
evaluation_geval.py) and COMET_API_KEY (optional, for Opik/Comet dataset
logging), and include brief comments for each variable so readers know which are
required vs optional.
| import argparse | ||
| import mimetypes | ||
| from pathlib import Path | ||
| from typing import List, Dict, Any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use Any in type hints or remove the import to satisfy Ruff F401
Currently, Any is imported but unused. Either remove it or use it to strengthen typings for invoice_data.
-from typing import List, Dict, Any
+from typing import List, Dict, Any
@@
- invoice_data: List[Dict] = []
+ invoice_data: List[Dict[str, Any]] = []This keeps Any and improves type specificity.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from typing import List, Dict, Any | |
| from typing import List, Dict, Any | |
| invoice_data: List[Dict[str, Any]] = [] |
🧰 Tools
🪛 Ruff (0.12.2)
4-4: typing.Any imported but unused
Remove unused import: typing.Any
(F401)
🤖 Prompt for AI Agents
In groundX-doc-pipeline/run_evaluation_cli.py around line 4, the import "Any" is
currently unused which triggers Ruff F401; either remove "Any" from the import
list or apply it to the invoice_data typing. To fix, annotate invoice_data where
it's defined/used (e.g., use Dict[str, Any] or List[Dict[str, Any]] as
appropriate) so Any is used, or delete ", Any" from the typing import if you
prefer no changes to annotations.
| except Exception as e: | ||
| print(f"Failed to process {path.name}: {e}") | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Don’t abort the entire run on a single file failure
Failing one invoice currently returns early, aborting the run. Prefer continuing and reporting what was processed.
- except Exception as e:
- print(f"Failed to process {path.name}: {e}")
- return
+ except Exception as e:
+ print(f"Failed to process {path.name}: {e}")
+ continue
+
+ if not invoice_data:
+ print("No invoices were processed successfully. Exiting.")
+ returnPlease verify downstream expects at least one invoice; the averages computation already guards against empty lists.
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="groundX-doc-pipeline/run_evaluation_cli.py"
echo "=== Showing file content (if exists): $FILE ==="
if [ -f "$FILE" ]; then
nl -ba "$FILE" | sed -n '1,400p'
else
echo "File not found: $FILE"
fi
echo
echo "=== Searching for invoice-related symbols in repository under groundX-doc-pipeline ==="
rg -n --hidden --no-ignore-vcs "invoice_data|invoice|averag|average|mean" groundX-doc-pipeline || true
echo
echo "=== Showing 'return' and surrounding lines in the file ==="
if [ -f "$FILE" ]; then
rg -n --context 3 "return" "$FILE" || true
fiLength of output: 465
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="groundX-doc-pipeline/run_evaluation_cli.py"
echo "=== File exists? ==="
if [ -f "$FILE" ]; then
echo "Yes"
else
echo "No"
fi
echo
echo "=== Printing file with line numbers (up to 400 lines) ==="
if [ -f "$FILE" ]; then
awk '{printf "%4d: %s\n", NR, $0}' "$FILE" | sed -n '1,400p' || true
else
echo "File not found: $FILE"
fi
echo
echo "=== Showing lines 60-100 with numbers ==="
if [ -f "$FILE" ]; then
awk '{printf "%4d: %s\n", NR, $0}' "$FILE" | sed -n '60,100p' || true
fi
echo
echo "=== Searching for invoice-related symbols under groundX-doc-pipeline ==="
if command -v rg >/dev/null 2>&1; then
rg -n --hidden --no-ignore-vcs "invoice_data|invoice|averag|average|mean" groundX-doc-pipeline || true
else
grep -RIn -E "invoice_data|invoice|averag|average|mean" groundX-doc-pipeline || true
fi
echo
echo "=== Showing 'return' occurrences with 3 lines of context in the file ==="
if [ -f "$FILE" ]; then
grep -n -C3 "return" "$FILE" || true
fiLength of output: 18699
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="groundX-doc-pipeline/evaluation_geval.py"
echo "=== File exists? ==="
if [ -f "$FILE" ]; then
echo "Yes"
else
echo "No"
exit 0
fi
echo
echo "=== Printing file with line numbers (1-400) ==="
awk '{printf "%4d: %s\n", NR, $0}' "$FILE" | sed -n '1,400p'
echo
echo "=== Showing 'def run' with context (lines 200-340) ==="
awk '{printf "%4d: %s\n", NR, $0}' "$FILE" | sed -n '200,340p'Length of output: 20242
Don't abort the whole run on a single file failure — continue and handle the zero-success case
Current code returns inside the per-file except handler, which stops processing remaining invoices. Change the except to continue and exit only if no invoices were processed successfully.
- File to change: groundX-doc-pipeline/run_evaluation_cli.py — except block around lines 79–81
Suggested patch:
- except Exception as e:
- print(f"Failed to process {path.name}: {e}")
- return
+ except Exception as e:
+ print(f"Failed to process {path.name}: {e}")
+ continue
+
+ if not invoice_data:
+ print("No invoices were processed successfully. Exiting.")
+ returnNote: evaluation_geval.run and the average-score calculations already handle empty result lists safely, but it's clearer to bail out with a message if no invoices succeeded.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception as e: | |
| print(f"Failed to process {path.name}: {e}") | |
| return | |
| except Exception as e: | |
| print(f"Failed to process {path.name}: {e}") | |
| continue | |
| if not invoice_data: | |
| print("No invoices were processed successfully. Exiting.") | |
| return |
🤖 Prompt for AI Agents
In groundX-doc-pipeline/run_evaluation_cli.py around lines 79–81, the per-file
except handler currently returns on error which aborts the entire run; change
the except block to continue so a single file failure doesn't stop processing
remaining invoices, and after the processing loop check if the count/list of
successful invoice results is empty — if so, print an explanatory message and
exit with a non-zero status; otherwise proceed with the existing evaluation and
averaging logic.
| ports=[8501], # Default port for streamlit | ||
| cpu=4, | ||
| memory="2Gi", | ||
| entrypoint=["streamlit", "run", "app.py"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Entrypoint may not resolve app.py depending on container working directory.
If the container’s working directory isn’t the same folder as app.py, Streamlit won’t find the file. Either set a working directory on the Pod or reference the full/relative path explicitly.
Apply this diff to use an explicit path (since server.py is in groundX-doc-pipeline/, app.py is in the same directory):
- entrypoint=["streamlit", "run", "app.py"],
+ entrypoint=["streamlit", "run", "groundX-doc-pipeline/app.py"],Alternatively (if supported), set Pod(working_dir="groundX-doc-pipeline") and keep "app.py".
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== rg: search for 'entrypoint' occurrences ==="
rg --hidden --glob '!.git' -n -S "entrypoint" || true
echo; echo "=== rg: search for 'streamlit' occurrences ==="
rg --hidden --glob '!.git' -n -S "streamlit" || true
echo; echo "=== rg: search for 'app.py' occurrences ==="
rg --hidden --glob '!.git' -n -S "app.py" || true
echo; echo "=== rg: search for 'server.py' occurrences ==="
rg --hidden --glob '!.git' -n -S "server.py" || true
# If any server.py files were found, print their contents
echo; echo "=== Showing candidate server.py files (up to 300 lines each) ==="
for f in $(rg --hidden --glob '!.git' -S --files-with-matches "server.py" || true); do
echo "---- $f ----"
sed -n '1,300p' "$f" || true
done
# Show any files that contain an entrypoint line (with surrounding 3 lines)
echo; echo "=== Files with 'entrypoint' (showing 3 lines of context) ==="
rg --hidden --glob '!.git' -n -S -C 3 "entrypoint" || true
# Show locations of app.py files (print up to 200 lines of each)
echo; echo "=== Showing candidate app.py files (up to 200 lines each) ==="
for f in $(rg --hidden --glob '!.git' -S --files-with-matches "app.py" || true); do
echo "---- $f ----"
sed -n '1,200p' "$f" || true
doneLength of output: 209939
Fix: Streamlit entrypoint may not resolve app.py in the Pod
streamlit run app.py is a relative path; if the Pod's working directory isn't the folder containing app.py, Streamlit will fail to find the file. Update groundX-doc-pipeline/server.py to either use an explicit path or set the Pod working directory.
- File to change:
- groundX-doc-pipeline/server.py — line with entrypoint (currently line 16)
Suggested diffs (choose one):
Option A — explicit path:
- entrypoint=["streamlit", "run", "app.py"],
- entrypoint=["streamlit", "run", "groundX-doc-pipeline/app.py"],
Option B — set Pod working directory (preferred if Pod supports it):
- memory="2Gi",
- entrypoint=["streamlit", "run", "app.py"],
- memory="2Gi",
- working_dir="groundX-doc-pipeline",
- entrypoint=["streamlit", "run", "app.py"],
Prefer Option B when supported; otherwise apply Option A.
🤖 Prompt for AI Agents
In groundX-doc-pipeline/server.py around line 16, the entrypoint uses a relative
path ("streamlit run app.py") which may fail if the Pod's working directory is
different; update the file to either (preferred) set
working_dir="groundX-doc-pipeline" immediately above the entrypoint so Streamlit
can resolve "app.py" or (if the Pod runtime doesn't support working_dir) replace
the entrypoint with an explicit path "streamlit run
groundX-doc-pipeline/app.py"; make only one of these changes and keep the rest
of the config unchanged.
|
MAJ |
Summary by CodeRabbit