Add GroundX doc processing pipeline files #182

namanvirk18 · 2025-08-14T14:26:16Z

Summary by CodeRabbit

New Features
- Introduced a Streamlit app for Ground X X-Ray document analysis with upload/auto-load, rich results (multi-tab views), and an integrated chat assistant.
- Added a CLI to evaluate invoice parsing and compare model outputs with scoring and summaries.
- Provided a server script to deploy and host the Streamlit app.
Documentation
- Added a comprehensive README covering setup, usage, workflow, and evaluation tools.
Chores
- Included example environment variables for external services.
- Added project configuration and tooling via pyproject.toml.

coderabbitai · 2025-08-14T14:26:23Z

Walkthrough

Introduces a new Streamlit-based Ground X document analysis app with chat, supporting utilities for Ground X ingestion/X-Ray retrieval, an evaluation pipeline using GEval and GPT-4o comparison, a CLI to run evaluations, packaging/configuration files, environment variable examples, and a Beam-based server launcher. No existing code modified.

Changes

Cohort / File(s)	Summary
Documentation & Env `groundX-doc-pipeline/.env.example`, `groundX-doc-pipeline/README.md`	Adds example API keys and a README detailing setup, usage, project structure, and evaluation notes.
Streamlit App `groundX-doc-pipeline/app.py`	New UI for upload, auto-load, Ground X processing, X-Ray visualization, and a local LLM chat. Adds helper functions for resetting analysis, preparing context, and generating chat responses.
Ground X Utilities `groundX-doc-pipeline/groundx_utils.py`	Helper APIs: client creation, bucket ensure, document ingestion, status polling with progress, X-Ray fetch, file existence checks, and end-to-end processing orchestration.
Evaluation Pipeline `groundX-doc-pipeline/evaluation_geval.py`, `groundX-doc-pipeline/run_evaluation_cli.py`	GEval-based evaluation module comparing Ground X vs GPT-4o on invoices and a CLI to discover files, collect questions, run evaluations, and report scores.
Packaging & Tooling `groundX-doc-pipeline/pyproject.toml`	Project metadata, dependencies, build system, and configs for Black, MyPy, and PyTest.
Server Deployment `groundX-doc-pipeline/server.py`	Beam Pod definition to run the Streamlit app with specified resources and dependencies; prints hosted URL.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Streamlit App
  participant GroundX Utils
  participant Ground X API
  participant Ollama LLM

  User->>Streamlit App: Upload/select document
  Streamlit App->>GroundX Utils: process_document(file)
  GroundX Utils->>Ground X API: ingest + poll status
  Ground X API-->>GroundX Utils: processing complete + xray_url
  GroundX Utils->>Ground X API: fetch X-Ray JSON
  Ground X API-->>GroundX Utils: xray_data
  GroundX Utils-->>Streamlit App: xray_data, used_existing
  Streamlit App->>User: Show X-Ray tabs and summaries

  User->>Streamlit App: Ask question in Chat
  Streamlit App->>Streamlit App: prepare_chat_context(xray_data, prompt)
  Streamlit App->>Ollama LLM: generate(prompt, context)
  Ollama LLM-->>Streamlit App: response
  Streamlit App-->>User: Chat answer

sequenceDiagram
  autonumber
  participant CLI
  participant EvaluatorGEval
  participant Ground X API
  participant OpenAI API
  participant GEval (Opik)

  CLI->>EvaluatorGEval: run(invoices, questions)
  loop per invoice
    EvaluatorGEval->>Ground X API: ingest + poll + fetch X-Ray
    Ground X API-->>EvaluatorGEval: xray_data
    EvaluatorGEval->>OpenAI API: reasoning (image/PDF as needed)
    OpenAI API-->>EvaluatorGEval: model_output
    EvaluatorGEval->>GEval (Opik): score(model_output vs expected)
    GEval (Opik)-->>EvaluatorGEval: score, reason
  end
  EvaluatorGEval-->>CLI: results and averages
  CLI-->>CLI: print breakdown

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

patchy631

Poem

A rabbit taps keys with a caffeinated grin,
Buckets and X-Rays, let the parsing begin!
Streamlit blooms bright, chats hop into view—
GEval keeps score, invoices in queue.
From pods to prompts, I thump in delight,
Carrots for context, we ship it tonight! 🥕✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 18

🧹 Nitpick comments (18)

groundX-doc-pipeline/.env.example (1)
1-3: Add SAMPLE_FILE_NAME and address linter nit (end newline).

The app reads SAMPLE_FILE_NAME (Line 213 in app.py), but it's not present here. Also, add a trailing newline to satisfy dotenv-linter.

Apply this diff:
 GROUNDX_API_KEY=your-groundx-api-key-here
 OPENAI_API_KEY=your-openai-api-key-here
 COMET_API_KEY=your-comet-api-key-here
+SAMPLE_FILE_NAME=sample-file.pdf
+
Note: Keep this file as an example and never commit real secrets.
groundX-doc-pipeline/server.py (1)
3-17: Environment variables not passed to the Pod (GROUNDX_API_KEY, OPENAI_API_KEY).

The app loads API keys from environment, but the Pod doesn’t receive them. Without passing env vars, the app will likely fail at client creation.

If Pod supports env passthrough, add something like:
# Example (adjust to your Beam API):
streamlit_server = Pod(
    image=Image().add_python_packages([...]),
    ports=[8501],
    cpu=4,
    memory="2Gi",
    entrypoint=["streamlit", "run", "groundX-doc-pipeline/app.py"],
    env={
        "GROUNDX_API_KEY": os.environ.get("GROUNDX_API_KEY", ""),
        "OPENAI_API_KEY": os.environ.get("OPENAI_API_KEY", ""),
        "COMET_API_KEY": os.environ.get("COMET_API_KEY", ""),
        "SAMPLE_FILE_NAME": os.environ.get("SAMPLE_FILE_NAME", ""),
    },
)
If Beam has a first-class Secrets mechanism, prefer that over env to avoid accidental leakage.
groundX-doc-pipeline/groundx_utils.py (1)
2-2: Remove unused import.

tempfile is imported but unused.

Apply this diff:
-import tempfile
groundX-doc-pipeline/app.py (4)
1-5: Remove unused typing imports.

Any and Dict are not used.

Apply this diff:
-import os
-import tempfile
-import requests
-from typing import Any, Dict
+import os
+import tempfile
+import requests
414-418: Collapse nested context managers for clarity.

Use a single with statement.

Apply this diff:
-            with st.chat_message("assistant"):
-                with st.spinner("Thinking..."):
-                    context = prepare_chat_context(xray, prompt)
-                    response = generate_chat_response(prompt, context)
-                    st.session_state.chat_history.append({"role": "assistant", "content": response})
-                    st.markdown(response)
+            with st.chat_message("assistant"), st.spinner("Thinking..."):
+                context = prepare_chat_context(xray, prompt)
+                response = generate_chat_response(prompt, context)
+                st.session_state.chat_history.append({"role": "assistant", "content": response})
+                st.markdown(response)
114-140: Ollama-only chat backend may fail in hosted environments; add a fallback.

The Beam Pod likely won’t have an Ollama daemon on localhost. Provide a fallback to OpenAI if OPENAI_API_KEY is set; otherwise, retain Ollama.

I can prepare a small helper to detect availability and route accordingly. Do you want me to add that with minimal config toggles?

18-52: CSS overrides depend on private Streamlit classes and may break across versions.

These class names are not part of the public API. Consider keeping the overrides minimal or feature-gating them.

If you’ve verified against your target Streamlit version, feel free to ignore. Otherwise, guard with a comment and be prepared to adjust on upgrades.
groundX-doc-pipeline/README.md (3)
19-21: Clarify dependency installation (uv vs pip) and prerequisites

“uv sync” assumes uv is installed. Add a quick note or fallback to pip so users aren’t blocked.

Proposed wording (no diff since it adds lines):

Add a line above the code fence: “If you don’t have uv installed, see https://docs.astral.sh/uv/ or run pip install -U uv.”

Or add an alternative block: “pip install -r requirements.txt (or pip install -e . if using pyproject).”

37-41: Add a cd step to avoid path confusion

Make it explicit you should run from the project directory.
-```bash
-streamlit run app.py
-```
+```bash
+cd groundX-doc-pipeline
+streamlit run app.py
+```
45-56: Fix MD040: add language to fenced block

The project structure fence should specify a language (e.g., text).
-```
+```text
 groundX-doc-pipeline/
 ├── app.py                          # Main Streamlit application (uses groundx_utils.py)
 ├── groundx_utils.py                # Utility functions for Ground X operations
 ├── .env                            # Environment variables (create from .env.example)
 ├── file/                           # Folder containing files for running evaluation
 └── README.md                       # This file
 
 📁 Evaluation Tools:
 ├── evaluation_geval.py             # GEval framework evaluation
 └── run_evaluation_cli.py           # CLI evaluation runner
-```
+```
groundX-doc-pipeline/run_evaluation_cli.py (4)
16-19: Deterministic file ordering for reproducible evaluations

Iterdir order is not guaranteed. Sort by filename for consistent runs.
-    files = [p for p in folder.iterdir() if p.is_file() and p.suffix.lower() in ALLOWED_EXTENSIONS]
+    files = sorted(
+        (p for p in folder.iterdir() if p.is_file() and p.suffix.lower() in ALLOWED_EXTENSIONS),
+        key=lambda p: p.name.lower()
+    )
141-151: Simplify exact/partial matching branches (SIM114)

Combine the two branches to reduce duplication.
-                    # Exact match validation
-                    if q_lower == expected_q_lower:
-                        inv["expected_outputs"][q] = expected_a
-                        matched = True
-                        break
-                    
-                    # Partial match validation using key identifying words
-                    elif _smart_partial_match(q_lower, expected_q_lower):
-                        inv["expected_outputs"][q] = expected_a
-                        matched = True
-                        break
+                    # Exact or partial match
+                    if q_lower == expected_q_lower or _smart_partial_match(q_lower, expected_q_lower):
+                        inv["expected_outputs"][q] = expected_a
+                        matched = True
+                        break
178-180: Remove extraneous f-prefix in f-string (F541)

No placeholders present.
-    print(f"\nDetailed Results:")
+    print("\nDetailed Results:")
129-156: Brittle dataset-key detection by exact filename base

Matching only when Path(...).stem equals "electricity" or "energy-plus" is fragile. Consider a contains-based match so “electricity-invoice-1.pdf” still maps.
-        base = Path(inv["name"]).stem.lower()
-        if base in EXPECTED_ANSWERS:
+        base = Path(inv["name"]).stem.lower()
+        dataset_key = next((k for k in EXPECTED_ANSWERS if k in base), None)
+        if dataset_key:
-            for q in questions:
+            for q in questions:
@@
-                for expected_q, expected_a in EXPECTED_ANSWERS[base].items():
+                for expected_q, expected_a in EXPECTED_ANSWERS[dataset_key].items():
groundX-doc-pipeline/evaluation_geval.py (4)
17-38: Hard fail on empty expected_output may be too strict

The evaluator raises if expected_output is empty. The CLI fills defaults, but if this function is reused elsewhere it’ll return an error pathway. Consider treating empty expected_output as “data not available” to avoid hard failures.
-        if not model_output or not expected_output:
-            raise ValueError("Model output and expected output cannot be empty")
+        if not model_output:
+            raise ValueError("Model output cannot be empty")
+        if not expected_output:
+            expected_output = "data not available"
If you intended the strict behavior, ignore this suggestion.

245-245: Avoid double dataset clear/insert; insert once after enriching samples

You clear/insert before and after evaluation. Keep a single insert at the end to reduce churn (and potential race conditions if datasets are viewed live).
-        dataset.clear(); dataset.insert(samples)
+        # Defer dataset insert until after evaluation so samples include scores and outputs.
261-261: Rename unused loop variable (B007)

Index isn’t used in this loop.
-        for i, sample in enumerate(samples):
+        for _i, sample in enumerate(samples):
179-179: Remove extraneous f-prefix in debug print (F541)

Minor cleanup.
-            print(f"DEBUG: Using image file directly")
+            print("DEBUG: Using image file directly")
Note: The broader fix in the PDF block comment also addresses similar prints.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 033c6a5 and 421f797.

⛔ Files ignored due to path filters (4)

groundX-doc-pipeline/assets/groundx.png is excluded by !**/*.png
groundX-doc-pipeline/file/electricity.pdf is excluded by !**/*.pdf
groundX-doc-pipeline/file/energy-plus.pdf is excluded by !**/*.pdf
groundX-doc-pipeline/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (8)

groundX-doc-pipeline/.env.example (1 hunks)
groundX-doc-pipeline/README.md (1 hunks)
groundX-doc-pipeline/app.py (1 hunks)
groundX-doc-pipeline/evaluation_geval.py (1 hunks)
groundX-doc-pipeline/groundx_utils.py (1 hunks)
groundX-doc-pipeline/pyproject.toml (1 hunks)
groundX-doc-pipeline/run_evaluation_cli.py (1 hunks)
groundX-doc-pipeline/server.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (4)

groundX-doc-pipeline/run_evaluation_cli.py (1)

groundX-doc-pipeline/evaluation_geval.py (3)

create_evaluator_geval (288-295)

process_invoice (117-127)

run (229-285)

groundX-doc-pipeline/app.py (1)

groundX-doc-pipeline/groundx_utils.py (5)

create_client (18-22)

ensure_bucket (25-34)

check_file_exists (115-121)

get_xray_for_existing_document (123-136)

process_document (138-159)

groundX-doc-pipeline/server.py (2)

imagegen-janus-pro/app.py (1)

main (198-264)

audio-analysis-toolkit/app.py (1)

main (662-810)

groundX-doc-pipeline/evaluation_geval.py (2)

gpt-oss-vs-qwen3/code_evaluation_opik.py (1)

evaluate_reasoning (4-212)

code-model-comparison/code_evaluation_opik.py (1)

evaluate_code (4-172)

🪛 Ruff (0.12.2)

groundX-doc-pipeline/run_evaluation_cli.py

4-4: typing.Any imported but unused

Remove unused import: typing.Any

(F401)

142-151: Combine if branches using logical or operator

Combine if branches

(SIM114)

179-179: f-string without any placeholders

Remove extraneous f prefix

(F541)

225-227: Use a single if statement instead of nested if statements

(SIM102)

groundX-doc-pipeline/groundx_utils.py

2-2: tempfile imported but unused

Remove unused import: tempfile

(F401)

groundX-doc-pipeline/app.py

4-4: typing.Any imported but unused

Remove unused import

(F401)

4-4: typing.Dict imported but unused

Remove unused import

(F401)

236-236: Use a context manager for opening files

(SIM115)

414-415: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)

groundX-doc-pipeline/evaluation_geval.py

12-12: opik.evaluation.evaluate imported but unused

Remove unused import: opik.evaluation.evaluate

(F401)

13-13: opik.evaluation.metrics.BaseMetric imported but unused

Remove unused import: opik.evaluation.metrics.BaseMetric

(F401)

179-179: f-string without any placeholders

Remove extraneous f prefix

(F541)

186-186: f-string without any placeholders

Remove extraneous f prefix

(F541)

215-215: f-string without any placeholders

Remove extraneous f prefix

(F541)

217-217: f-string without any placeholders

Remove extraneous f prefix

(F541)

245-245: Multiple statements on one line (semicolon)

(E702)

261-261: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

🪛 LanguageTool

groundX-doc-pipeline/README.md

[grammar] ~7-~7: There might be a mistake here.
Context: ...A document processing and X-Ray analysis - Streamlit for the UI - Ollama for servin...

(QB_NEW_EN)

[grammar] ~8-~8: There might be a mistake here.
Context: ...nd X-Ray analysis - Streamlit for the UI - Ollama for serving LLM locally --- ## ...

(QB_NEW_EN)

[grammar] ~61-~61: There might be a mistake here.
Context: ...the document to be processed by Ground X 3. Explore the X-Ray analysis results in di...

(QB_NEW_EN)

[grammar] ~62-~62: There might be a mistake here.
Context: ...-Ray analysis results in different tabs: - JSON Output: Raw analysis data - Narr...

(QB_NEW_EN)

[grammar] ~63-~63: There might be a mistake here.
Context: ...abs: - JSON Output: Raw analysis data - Narrative Summary: Extracted narratives ...

(QB_NEW_EN)

[grammar] ~64-~64: There might be a mistake here.
Context: ... Narrative Summary: Extracted narratives - File Summary: Document overview - Sug...

(QB_NEW_EN)

[grammar] ~65-~65: There might be a mistake here.
Context: ...ves - File Summary: Document overview - Suggested Text: AI-suggested content ...

(QB_NEW_EN)

[grammar] ~66-~66: There might be a mistake here.
Context: ... - Suggested Text: AI-suggested content - Extracted Text: Raw text extraction -...

(QB_NEW_EN)

[grammar] ~67-~67: There might be a mistake here.
Context: ... - Extracted Text: Raw text extraction - Keywords: Document keywords 4. Use the c...

(QB_NEW_EN)

[grammar] ~68-~68: There might be a mistake here.
Context: ...raction - Keywords: Document keywords 4. Use the chat interface to ask questions ...

(QB_NEW_EN)

[grammar] ~75-~75: There might be a mistake here.
Context: ...ucket creation and document organization - Document Ingestion: Support for PDF, W...

(QB_NEW_EN)

[grammar] ~76-~76: There might be a mistake here.
Context: ...ort for PDF, Word docs, images, and more - X-Ray Analysis: Rich structured data w...

(QB_NEW_EN)

[grammar] ~77-~77: There might be a mistake here.
Context: ...ies, page chunks, keywords, and metadata - Context Engineering: Intelligent conte...

(QB_NEW_EN)

[grammar] ~78-~78: There might be a mistake here.
Context: ...gent context preparation for LLM queries - AI Chat Interface: Interactive Q&A pow...

(QB_NEW_EN)

[grammar] ~83-~83: There might be a mistake here.
Context: ... ## 📬 Stay Updated with Our Newsletter! Get a FREE Data Science eBook 📖 with 1...

(QB_NEW_EN)

[grammar] ~85-~85: There might be a mistake here.
Context: ...📖 with 150+ essential lessons in Data Science when you subscribe to our newsletter! St...

(QB_NEW_EN)

[style] ~93-~93: Using many exclamation marks might seem excessive (in this case: 4 exclamation marks for a text that’s 2081 characters long)
Context: ... Contribution Contributions are welcome! Please fork the repository and submit a...

(EN_EXCESSIVE_EXCLAMATION)

🪛 markdownlint-cli2 (0.17.2)

groundX-doc-pipeline/README.md

25-25: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

45-45: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 dotenv-linter (3.3.0)

groundX-doc-pipeline/.env.example

[warning] 3-3: [EndingBlankLine] No blank line at the end of the file

(EndingBlankLine)

[warning] 3-3: [UnorderedKey] The COMET_API_KEY key should go before the GROUNDX_API_KEY key

(UnorderedKey)

🔇 Additional comments (5)

groundX-doc-pipeline/server.py (1)

4-12: Confirm Beam pip package name alignment with pyproject.

Code imports from beam, while pyproject declares beam-client. This mismatch will cause ImportError at runtime.

Ensure the dependency name matches the import. See comment in pyproject.toml for the fix.

groundX-doc-pipeline/pyproject.toml (1)

11-19: Incorrect — beam vs beam-client and OpenAI version are valid in this repo; do not apply the suggested diff

Short check summary:

The repo declares and locks the packages the code uses — the original suggestions are not required.

Files/locations to note:

groundX-doc-pipeline/pyproject.toml — contains "openai>=1.99.9" and "beam-client>=0.1.0".

groundX-doc-pipeline/server.py — imports "from beam import Image, Pod" (line 1).

groundX-doc-pipeline/uv.lock — shows beam_client distribution (sdist/wheel, e.g. beam_client-0.2.175) and openai-1.99.9 (sdist/wheel, upload-time entries present).

Conclusion:

The dependency name mismatch flagged in the original comment is not actionable here (the lockfile contains the beam-client artifact and the code imports beam).

The OpenAI version (1.99.9) appears in the lockfile (sdist/wheel entries), so it is not an invalid/unreleased reference in this repo.

Action: ignore the proposed diff. If you intended a stricter OpenAI constraint or a different Beam package, explain the target (PyPI package vs import name) and I’ll re-check for a safe change.

Likely an incorrect or invalid review comment.

groundX-doc-pipeline/groundx_utils.py (1)

36-51: Validate file_type derivation for non-PDFs (docx mime).

For DOCX, file_to_process.type is typically application/vnd.openxmlformats-officedocument.wordprocessingml.document, so file_type becomes a long vendor string. Confirm Ground X expects that value vs. docx.

If Ground X requires extensions, map common MIME types to expected short types before calling Document(...). I can provide a mapper if needed.

groundX-doc-pipeline/app.py (1)

210-228: Default SAMPLE_FILE_NAME looks auto-generated and not documented.

The default tmpivkf8qf8_sample-file.pdf is likely a placeholder. It’s better to rely on an env var and document it in .env.example (see suggested addition there).

If you want a more discoverable default, consider sample-file.pdf and ship a sample in assets/.

groundX-doc-pipeline/README.md (1)

29-35: LGTM on Ollama setup steps

The steps to pull and serve the model are clear and accurate.

coderabbitai · 2025-08-14T14:39:43Z

groundX-doc-pipeline/app.py

+# Initialize Streamlit session state
+for key in ["xray_data", "uploaded_file_path", "uploaded_file_name", "uploaded_file_type", "processing_complete", "used_existing_file", "auto_loaded_file"]:
+    if key not in st.session_state:
+        st.session_state[key] = None if key == "xray_data" else False
+


🛠️ Refactor suggestion

Initialize session state with correct default types per key.

Using False for string-typed keys (e.g., uploaded_file_name) is confusing. Initialize explicitly.

Apply this diff:

-for key in ["xray_data", "uploaded_file_path", "uploaded_file_name", "uploaded_file_type", "processing_complete", "used_existing_file", "auto_loaded_file"]: - if key not in st.session_state: - st.session_state[key] = None if key == "xray_data" else False +st.session_state.setdefault("xray_data", None) +st.session_state.setdefault("uploaded_file_path", None) +st.session_state.setdefault("uploaded_file_name", None) +st.session_state.setdefault("uploaded_file_type", None) +st.session_state.setdefault("processing_complete", False) +st.session_state.setdefault("used_existing_file", False) +st.session_state.setdefault("auto_loaded_file", False) +st.session_state.setdefault("chat_history", [])

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Initialize Streamlit session state

for key in ["xray_data", "uploaded_file_path", "uploaded_file_name", "uploaded_file_type", "processing_complete", "used_existing_file", "auto_loaded_file"]:

if key not in st.session_state:

st.session_state[key] = None if key == "xray_data" else False

# Initialize Streamlit session state

st.session_state.setdefault("xray_data", None)

st.session_state.setdefault("uploaded_file_path", None)

st.session_state.setdefault("uploaded_file_name", None)

st.session_state.setdefault("uploaded_file_type", None)

st.session_state.setdefault("processing_complete", False)

st.session_state.setdefault("used_existing_file", False)

st.session_state.setdefault("auto_loaded_file", False)

st.session_state.setdefault("chat_history", [])

🤖 Prompt for AI Agents

In groundX-doc-pipeline/app.py around lines 141 to 145, the session state initialization sets many string-typed keys to False which is confusing; update the initialization to assign explicit defaults per key (e.g., xray_data = None, uploaded_file_path = None or "" depending on how code treats absent path, uploaded_file_name = "", uploaded_file_type = "", processing_complete = False, used_existing_file = False, auto_loaded_file = False) by replacing the generic loop with explicit st.session_state assignments for each key so each key has the correct default type.

coderabbitai · 2025-08-14T14:39:43Z

groundX-doc-pipeline/app.py

+    tmp_file = tempfile.NamedTemporaryFile(delete=False, suffix=f"_{uploaded.name}")
+    tmp_file.write(uploaded.getbuffer())
+    tmp_file.close()
+    st.session_state.uploaded_file_path = tmp_file.name
+elif st.session_state.uploaded_file_path and not st.session_state.processing_complete:


🛠️ Refactor suggestion

Use context manager when writing the uploaded file to disk.

Avoid leaving file handles open and ensure proper cleanup.

Apply this diff:

- tmp_file = tempfile.NamedTemporaryFile(delete=False, suffix=f"_{uploaded.name}") - tmp_file.write(uploaded.getbuffer()) - tmp_file.close() - st.session_state.uploaded_file_path = tmp_file.name + with tempfile.NamedTemporaryFile(delete=False, suffix=f"_{uploaded.name}") as tmp_file: + tmp_file.write(uploaded.getbuffer()) + st.session_state.uploaded_file_path = tmp_file.name

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

tmp_file = tempfile.NamedTemporaryFile(delete=False, suffix=f"_{uploaded.name}")

tmp_file.write(uploaded.getbuffer())

tmp_file.close()

st.session_state.uploaded_file_path = tmp_file.name

elif st.session_state.uploaded_file_path and not st.session_state.processing_complete:

with tempfile.NamedTemporaryFile(delete=False, suffix=f"_{uploaded.name}") as tmp_file:

tmp_file.write(uploaded.getbuffer())

st.session_state.uploaded_file_path = tmp_file.name

elif st.session_state.uploaded_file_path and not st.session_state.processing_complete:

🧰 Tools

🪛 Ruff (0.12.2)

236-236: Use a context manager for opening files

(SIM115)

🤖 Prompt for AI Agents

In groundX-doc-pipeline/app.py around lines 236 to 240, the code opens a NamedTemporaryFile and writes to it without a context manager which can leave file handles open; change to use a with NamedTemporaryFile(delete=False, suffix=f"_{uploaded.name}") as tmp_file: write uploaded.getbuffer() to tmp_file, flush if needed, then set st.session_state.uploaded_file_path = tmp_file.name so the file is closed automatically and handles are cleaned up.

coderabbitai · 2025-08-14T14:39:44Z

groundX-doc-pipeline/evaluation_geval.py

+from opik.evaluation import evaluate
+from opik.evaluation.metrics import BaseMetric, GEval


🛠️ Refactor suggestion

Remove unused imports (Ruff F401)

evaluate and BaseMetric are imported but unused.

-from opik.evaluation import evaluate -from opik.evaluation.metrics import BaseMetric, GEval +from opik.evaluation.metrics import GEval

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from opik.evaluation import evaluate

from opik.evaluation.metrics import BaseMetric, GEval

from opik.evaluation.metrics import GEval

🧰 Tools

🪛 Ruff (0.12.2)

12-12: opik.evaluation.evaluate imported but unused

Remove unused import: opik.evaluation.evaluate

(F401)

13-13: opik.evaluation.metrics.BaseMetric imported but unused

Remove unused import: opik.evaluation.metrics.BaseMetric

(F401)

🤖 Prompt for AI Agents

In groundX-doc-pipeline/evaluation_geval.py around lines 12 to 13, the imports 'evaluate' and 'BaseMetric' are unused (Ruff F401); remove those unused imports and only keep the import(s) actually used (e.g., retain 'GEval' from opik.evaluation.metrics) so the file imports are minimal and lint-clean.

coderabbitai · 2025-08-14T14:39:44Z

groundX-doc-pipeline/evaluation_geval.py

+    # ---------------- ingestion helpers --------------
+    def process_invoice(self, file_path: str, file_name: str, mime_type: str) -> Dict[str, Any]:
+        ingest = self.gx.ingest(documents=[
+            Document(
+                bucket_id=int(self.bucket_id),
+                file_name=file_name,
+                file_path=file_path,
+                file_type=mime_type.split("/")[-1],
+            )
+        ])


⚠️ Potential issue

Use file extension for Document.file_type (fixes .docx and similar)

Using MIME subtype (e.g., vnd.openxmlformats-officedocument...) as file_type will likely break ingestion for docx. Use the filename extension instead.

- ingest = self.gx.ingest(documents=[ - Document( - bucket_id=int(self.bucket_id), - file_name=file_name, - file_path=file_path, - file_type=mime_type.split("/")[-1], - ) - ]) + file_ext = os.path.splitext(file_name)[1].lstrip(".").lower() or (mime_type.split("/")[-1] if mime_type else "") + ingest = self.gx.ingest( + documents=[ + Document( + bucket_id=int(self.bucket_id), + file_name=file_name, + file_path=file_path, + file_type=file_ext, + ) + ] + )

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In groundX-doc-pipeline/evaluation_geval.py around lines 116 to 125, the Document.file_type is currently derived from the MIME subtype which can produce long values like vnd.openxmlformats-officedocument and break ingestion for formats like .docx; change it to extract the file extension from file_name (e.g., use pathlib.Path(file_name).suffix or split on '.'), strip the leading dot and lowercase the result (fallback to the mime_type subtype only if no extension is present) and pass that cleaned extension as file_type.

coderabbitai · 2025-08-14T14:39:44Z

groundX-doc-pipeline/evaluation_geval.py

+    def _fetch_xray(self, expected_name: str):
+        docs = self.gx.documents.lookup(id=self.bucket_id).documents
+        doc = next((d for d in docs if getattr(d, "file_name", None) == expected_name), docs[0])
+        if getattr(doc, "xray_url", None):
+            r = requests.get(doc.xray_url)
+            r.raise_for_status()
+            return r.json()
+        raise RuntimeError("X-Ray URL missing")
+


⚠️ Potential issue

Avoid wrong-document fallback and add HTTP timeout

Defaulting to the first document if the exact name isn’t found can fetch the wrong X-Ray. Also, requests without a timeout can hang.

- docs = self.gx.documents.lookup(id=self.bucket_id).documents - doc = next((d for d in docs if getattr(d, "file_name", None) == expected_name), docs[0]) - if getattr(doc, "xray_url", None): - r = requests.get(doc.xray_url) + docs = self.gx.documents.lookup(id=self.bucket_id).documents + doc = next((d for d in docs if getattr(d, "file_name", None) == expected_name), None) + if not doc: + raise RuntimeError(f"Document '{expected_name}' not found in bucket {self.bucket_id}") + if getattr(doc, "xray_url", None): + r = requests.get(doc.xray_url, timeout=30) r.raise_for_status() return r.json() - raise RuntimeError("X-Ray URL missing") + raise RuntimeError(f"X-Ray URL missing for '{expected_name}'")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _fetch_xray(self, expected_name: str):

docs = self.gx.documents.lookup(id=self.bucket_id).documents

doc = next((d for d in docs if getattr(d, "file_name", None) == expected_name), docs[0])

if getattr(doc, "xray_url", None):

r = requests.get(doc.xray_url)

r.raise_for_status()

return r.json()

raise RuntimeError("X-Ray URL missing")

def _fetch_xray(self, expected_name: str):

docs = self.gx.documents.lookup(id=self.bucket_id).documents

doc = next((d for d in docs if getattr(d, "file_name", None) == expected_name), None)

if not doc:

raise RuntimeError(f"Document '{expected_name}' not found in bucket {self.bucket_id}")

if getattr(doc, "xray_url", None):

r = requests.get(doc.xray_url, timeout=30)

r.raise_for_status()

return r.json()

raise RuntimeError(f"X-Ray URL missing for '{expected_name}'")

🤖 Prompt for AI Agents

In groundX-doc-pipeline/evaluation_geval.py around lines 141 to 149, stop falling back to docs[0] when the expected file_name is missing and add an HTTP timeout to requests.get: locate the document by exact file_name and if not found raise a RuntimeError (include expected_name and bucket_id in the message), then call requests.get(doc.xray_url, timeout=10) and r.raise_for_status() as before; ensure you also raise a clear error if xray_url is missing on the matched document.

coderabbitai · 2025-08-14T14:39:45Z

groundX-doc-pipeline/pyproject.toml

+    {name = "Your Name", email = "your.email@example.com"}
+]
+readme = "README.md"
+requires-python = ">=3.8.1"


💡 Verification agent

🧩 Analysis chain

Python version vs. type annotations mismatch.

Project requires Python >=3.8.1 but groundx_utils.py uses tuple[...] generics (Python 3.9+). Either:

Bump requires-python to >=3.9, or

Make the annotations 3.8-compatible (recommended for now).

If you want to keep 3.8 support, see the suggested fix in groundx_utils.py to switch to typing.Tuple[...].

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== pyproject.toml (groundX-doc-pipeline) ==" if [ -f groundX-doc-pipeline/pyproject.toml ]; then sed -n '1,120p' groundX-doc-pipeline/pyproject.toml || true else echo "groundX-doc-pipeline/pyproject.toml not found" fi echo echo "== locate groundx_utils.py ==" file=$(find . -type f -name 'groundx_utils.py' -print -quit || true) if [ -n "$file" ]; then echo "FOUND: $file" echo "---- start file ----" sed -n '1,300p' "$file" || true echo "---- end file ----" else echo "groundx_utils.py not found via find. Searching repo for matches..." file=$(rg --files 2>/dev/null | rg 'groundx_utils.py$' || true) if [ -n "$file" ]; then echo "FOUND (rg): $file" sed -n '1,300p' "$file" || true else echo "groundx_utils.py not found in repo" fi fi echo echo "== occurrences of 'tuple[' across repo ==" rg -n --hidden --line-number "tuple\[" -S || true echo echo "== occurrences of 'from __future__ import annotations' across repo ==" rg -n --hidden --line-number "from __future__ import annotations" -S || true

Length of output: 8751

Action required — Python version mismatch: pyproject specifies >=3.8.1 but code uses PEP 585 annotations (tuple[...])

Either bump requires-python to >=3.9 or make annotations 3.8-compatible. Recommended: change the annotation to typing.Tuple[...] for immediate 3.8 support.

Files to update:

groundX-doc-pipeline/pyproject.toml — current: requires-python = ">=3.8.1"

groundX-doc-pipeline/groundx_utils.py — return annotation uses tuple[Dict[str, Any], bool]

Suggested minimal fix (change typing import + return annotation):

--- a/groundX-doc-pipeline/groundx_utils.py +++ b/groundX-doc-pipeline/groundx_utils.py @@ -from typing import Any, Dict, Optional +from typing import Any, Dict, Optional, Tuple @@ -def process_document(gx: GroundX, bucket_id: str, file_to_process: Any, file_path: str) -> tuple[Dict[str, Any], bool]: +def process_document(gx: GroundX, bucket_id: str, file_to_process: Any, file_path: str) -> Tuple[Dict[str, Any], bool]:

Notes: I found tuple[...] only in groundX-doc-pipeline/groundx_utils.py; other files either already use typing.Tuple or use PEP563 elsewhere. Applying the patch above resolves the mismatch without raising the project Python minimum.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

requires-python = ">=3.8.1"

from typing import Any, Dict, Optional, Tuple

def process_document(gx: GroundX, bucket_id: str, file_to_process: Any, file_path: str) -> Tuple[Dict[str, Any], bool]:

coderabbitai · 2025-08-14T14:39:46Z

groundX-doc-pipeline/README.md

+Copy `.env.example` to `.env` and configure the following environment variables:
+
+```
+GROUNDX_API_KEY=your_groundx_api_key_here
+```


🛠️ Refactor suggestion

Document all required environment variables and specify fence language

The README only lists GROUNDX_API_KEY, but the evaluation flow requires OPENAI_API_KEY and optionally COMET_API_KEY. Also, markdownlint flags MD040 for this fence (missing language).

Apply this diff to fix both:

-Copy `.env.example` to `.env` and configure the following environment variables: - -``` -GROUNDX_API_KEY=your_groundx_api_key_here -``` +Copy `.env.example` to `.env` and configure the following environment variables: + +```dotenv +GROUNDX_API_KEY=your_groundx_api_key_here +# Required for GPT-4o comparisons in evaluation_geval.py +OPENAI_API_KEY=your_openai_api_key_here +# Optional: needed only if you want Opik/Comet dataset logging +COMET_API_KEY=your_comet_opik_api_key_here +```

🧰 Tools

🪛 markdownlint-cli2 (0.17.2)

25-25: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In groundX-doc-pipeline/README.md around lines 23 to 27, the env instructions only list GROUNDX_API_KEY and use an unlabelled code fence; update the section to use a fenced block with language "dotenv" and document required/optional variables by adding OPENAI_API_KEY (required for GPT-4o comparisons in evaluation_geval.py) and COMET_API_KEY (optional, for Opik/Comet dataset logging), and include brief comments for each variable so readers know which are required vs optional.

coderabbitai · 2025-08-14T14:39:46Z

groundX-doc-pipeline/run_evaluation_cli.py

+import argparse
+import mimetypes
+from pathlib import Path
+from typing import List, Dict, Any


🛠️ Refactor suggestion

Use Any in type hints or remove the import to satisfy Ruff F401

Currently, Any is imported but unused. Either remove it or use it to strengthen typings for invoice_data.

-from typing import List, Dict, Any +from typing import List, Dict, Any @@ - invoice_data: List[Dict] = [] + invoice_data: List[Dict[str, Any]] = []

This keeps Any and improves type specificity.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from typing import List, Dict, Any

from typing import List, Dict, Any

invoice_data: List[Dict[str, Any]] = []

🧰 Tools

🪛 Ruff (0.12.2)

4-4: typing.Any imported but unused

Remove unused import: typing.Any

(F401)

🤖 Prompt for AI Agents

In groundX-doc-pipeline/run_evaluation_cli.py around line 4, the import "Any" is currently unused which triggers Ruff F401; either remove "Any" from the import list or apply it to the invoice_data typing. To fix, annotate invoice_data where it's defined/used (e.g., use Dict[str, Any] or List[Dict[str, Any]] as appropriate) so Any is used, or delete ", Any" from the typing import if you prefer no changes to annotations.

coderabbitai · 2025-08-14T14:39:46Z

groundX-doc-pipeline/run_evaluation_cli.py

+        except Exception as e:
+            print(f"Failed to process {path.name}: {e}")
+            return


💡 Verification agent

🧩 Analysis chain

Don’t abort the entire run on a single file failure

Failing one invoice currently returns early, aborting the run. Prefer continuing and reporting what was processed.

- except Exception as e: - print(f"Failed to process {path.name}: {e}") - return + except Exception as e: + print(f"Failed to process {path.name}: {e}") + continue + + if not invoice_data: + print("No invoices were processed successfully. Exiting.") + return

Please verify downstream expects at least one invoice; the averages computation already guards against empty lists.

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="groundX-doc-pipeline/run_evaluation_cli.py" echo "=== Showing file content (if exists): $FILE ===" if [ -f "$FILE" ]; then nl -ba "$FILE" | sed -n '1,400p' else echo "File not found: $FILE" fi echo echo "=== Searching for invoice-related symbols in repository under groundX-doc-pipeline ===" rg -n --hidden --no-ignore-vcs "invoice_data|invoice|averag|average|mean" groundX-doc-pipeline || true echo echo "=== Showing 'return' and surrounding lines in the file ===" if [ -f "$FILE" ]; then rg -n --context 3 "return" "$FILE" || true fi

Length of output: 465

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="groundX-doc-pipeline/run_evaluation_cli.py" echo "=== File exists? ===" if [ -f "$FILE" ]; then echo "Yes" else echo "No" fi echo echo "=== Printing file with line numbers (up to 400 lines) ===" if [ -f "$FILE" ]; then awk '{printf "%4d: %s\n", NR, $0}' "$FILE" | sed -n '1,400p' || true else echo "File not found: $FILE" fi echo echo "=== Showing lines 60-100 with numbers ===" if [ -f "$FILE" ]; then awk '{printf "%4d: %s\n", NR, $0}' "$FILE" | sed -n '60,100p' || true fi echo echo "=== Searching for invoice-related symbols under groundX-doc-pipeline ===" if command -v rg >/dev/null 2>&1; then rg -n --hidden --no-ignore-vcs "invoice_data|invoice|averag|average|mean" groundX-doc-pipeline || true else grep -RIn -E "invoice_data|invoice|averag|average|mean" groundX-doc-pipeline || true fi echo echo "=== Showing 'return' occurrences with 3 lines of context in the file ===" if [ -f "$FILE" ]; then grep -n -C3 "return" "$FILE" || true fi

Length of output: 18699

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="groundX-doc-pipeline/evaluation_geval.py" echo "=== File exists? ===" if [ -f "$FILE" ]; then echo "Yes" else echo "No" exit 0 fi echo echo "=== Printing file with line numbers (1-400) ===" awk '{printf "%4d: %s\n", NR, $0}' "$FILE" | sed -n '1,400p' echo echo "=== Showing 'def run' with context (lines 200-340) ===" awk '{printf "%4d: %s\n", NR, $0}' "$FILE" | sed -n '200,340p'

Length of output: 20242

Don't abort the whole run on a single file failure — continue and handle the zero-success case

Current code returns inside the per-file except handler, which stops processing remaining invoices. Change the except to continue and exit only if no invoices were processed successfully.

File to change: groundX-doc-pipeline/run_evaluation_cli.py — except block around lines 79–81

Suggested patch:

- except Exception as e: - print(f"Failed to process {path.name}: {e}") - return + except Exception as e: + print(f"Failed to process {path.name}: {e}") + continue + + if not invoice_data: + print("No invoices were processed successfully. Exiting.") + return

Note: evaluation_geval.run and the average-score calculations already handle empty result lists safely, but it's clearer to bail out with a message if no invoices succeeded.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except Exception as e:

print(f"Failed to process {path.name}: {e}")

return

except Exception as e:

print(f"Failed to process {path.name}: {e}")

continue

if not invoice_data:

print("No invoices were processed successfully. Exiting.")

return

🤖 Prompt for AI Agents

In groundX-doc-pipeline/run_evaluation_cli.py around lines 79–81, the per-file except handler currently returns on error which aborts the entire run; change the except block to continue so a single file failure doesn't stop processing remaining invoices, and after the processing loop check if the count/list of successful invoice results is empty — if so, print an explanatory message and exit with a non-zero status; otherwise proceed with the existing evaluation and averaging logic.

coderabbitai · 2025-08-14T14:39:46Z

groundX-doc-pipeline/server.py

+    ports=[8501],  # Default port for streamlit
+    cpu=4,
+    memory="2Gi",
+    entrypoint=["streamlit", "run", "app.py"],


💡 Verification agent

🧩 Analysis chain

Entrypoint may not resolve app.py depending on container working directory.

If the container’s working directory isn’t the same folder as app.py, Streamlit won’t find the file. Either set a working directory on the Pod or reference the full/relative path explicitly.

Apply this diff to use an explicit path (since server.py is in groundX-doc-pipeline/, app.py is in the same directory):

- entrypoint=["streamlit", "run", "app.py"], + entrypoint=["streamlit", "run", "groundX-doc-pipeline/app.py"],

Alternatively (if supported), set Pod(working_dir="groundX-doc-pipeline") and keep "app.py".

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== rg: search for 'entrypoint' occurrences ===" rg --hidden --glob '!.git' -n -S "entrypoint" || true echo; echo "=== rg: search for 'streamlit' occurrences ===" rg --hidden --glob '!.git' -n -S "streamlit" || true echo; echo "=== rg: search for 'app.py' occurrences ===" rg --hidden --glob '!.git' -n -S "app.py" || true echo; echo "=== rg: search for 'server.py' occurrences ===" rg --hidden --glob '!.git' -n -S "server.py" || true # If any server.py files were found, print their contents echo; echo "=== Showing candidate server.py files (up to 300 lines each) ===" for f in $(rg --hidden --glob '!.git' -S --files-with-matches "server.py" || true); do echo "---- $f ----" sed -n '1,300p' "$f" || true done # Show any files that contain an entrypoint line (with surrounding 3 lines) echo; echo "=== Files with 'entrypoint' (showing 3 lines of context) ===" rg --hidden --glob '!.git' -n -S -C 3 "entrypoint" || true # Show locations of app.py files (print up to 200 lines of each) echo; echo "=== Showing candidate app.py files (up to 200 lines each) ===" for f in $(rg --hidden --glob '!.git' -S --files-with-matches "app.py" || true); do echo "---- $f ----" sed -n '1,200p' "$f" || true done

Length of output: 209939

Fix: Streamlit entrypoint may not resolve app.py in the Pod

streamlit run app.py is a relative path; if the Pod's working directory isn't the folder containing app.py, Streamlit will fail to find the file. Update groundX-doc-pipeline/server.py to either use an explicit path or set the Pod working directory.

File to change:

groundX-doc-pipeline/server.py — line with entrypoint (currently line 16)

Suggested diffs (choose one):

Option A — explicit path:

entrypoint=["streamlit", "run", "app.py"],

entrypoint=["streamlit", "run", "groundX-doc-pipeline/app.py"],

Option B — set Pod working directory (preferred if Pod supports it):

memory="2Gi",

entrypoint=["streamlit", "run", "app.py"],

memory="2Gi",

working_dir="groundX-doc-pipeline",

entrypoint=["streamlit", "run", "app.py"],

Prefer Option B when supported; otherwise apply Option A.

🤖 Prompt for AI Agents

In groundX-doc-pipeline/server.py around line 16, the entrypoint uses a relative path ("streamlit run app.py") which may fail if the Pod's working directory is different; update the file to either (preferred) set working_dir="groundX-doc-pipeline" immediately above the entrypoint so Streamlit can resolve "app.py" or (if the Pod runtime doesn't support working_dir) replace the entrypoint with an explicit path "streamlit run groundX-doc-pipeline/app.py"; make only one of these changes and keep the rest of the config unchanged.

PapaBNdiaye · 2025-09-15T21:18:26Z

MAJ

Add GroundX doc processing pipeline files

421f797

coderabbitai bot reviewed Aug 14, 2025

View reviewed changes

patchy631 merged commit e597969 into patchy631:main Sep 4, 2025

-# Initialize Streamlit session state
-for key in ["xray_data", "uploaded_file_path", "uploaded_file_name", "uploaded_file_type", "processing_complete", "used_existing_file", "auto_loaded_file"]:
-    if key not in st.session_state:
-        st.session_state[key] = None if key == "xray_data" else False
+# Initialize Streamlit session state
+st.session_state.setdefault("xray_data", None)
+st.session_state.setdefault("uploaded_file_path", None)
+st.session_state.setdefault("uploaded_file_name", None)
+st.session_state.setdefault("uploaded_file_type", None)
+st.session_state.setdefault("processing_complete", False)
+st.session_state.setdefault("used_existing_file", False)
+st.session_state.setdefault("auto_loaded_file", False)
+st.session_state.setdefault("chat_history", [])

		from opik.evaluation import evaluate
		from opik.evaluation.metrics import BaseMetric, GEval

	from opik.evaluation import evaluate
	from opik.evaluation.metrics import BaseMetric, GEval
	from opik.evaluation.metrics import GEval

-requires-python = ">=3.8.1"
+from typing import Any, Dict, Optional, Tuple
+def process_document(gx: GroundX, bucket_id: str, file_to_process: Any, file_path: str) -> Tuple[Dict[str, Any], bool]:

Add GroundX doc processing pipeline files #182

Add GroundX doc processing pipeline files #182

Uh oh!

Conversation

namanvirk18 commented Aug 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

PapaBNdiaye commented Sep 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

namanvirk18 commented Aug 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 14, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)