Skip to content

Add PoC validation for Multimodal AI Eval Framework (#1226)#1334

Open
Spark960 wants to merge 1 commit intofoss42:mainfrom
Spark960:feat/ai-eval-poc
Open

Add PoC validation for Multimodal AI Eval Framework (#1226)#1334
Spark960 wants to merge 1 commit intofoss42:mainfrom
Spark960:feat/ai-eval-poc

Conversation

@Spark960
Copy link
Contributor

PR Description

Hi again,

Following up on the idea discussion thread (#1226) and the feedback from my initial idea submission (#1136) where a PoC was requested, I went ahead and built a full Proof of Concept.

I just want to say, building this was incredibly fun and it genuinely showed me how cool and necessary this project is. Getting to see the live evaluation pipeline actually work end-to-end was so insane. This PR updates my original idea document with the PoC results and architecture findings.

Short Summary of everything

  • I created a decoupled FastAPI + React pipeline that runs lm-eval safely in background threads.
  • The coolest part: I built a proxy middleware layer that intercepts lm-eval payloads and cleanses/sanitizes them. This completely solves the vendor specific schema crashes we see with strict APIs like Gemini and Groq (e.g: Gemini instantly throwing a 400 Bad Request if it sees a seed parameter). This proves we can make this tool truly vendor neutral!
  • Piped real-time execution logs from Python directly to the frontend using Server-Sent Events (SSE).

PoC Repository: https://github.com/Spark960/ai-eval
Demo: A gif demonstration of the live evaluation pipeline streaming via SSE is available in the PoC readme.

I'd love to know your thoughts and critique too :)
@animator and @ashitaprasad

Related Issues

Checklist

  • I have gone through the contributing guide
  • I have updated my branch and synced it with project main branch before making this PR
  • I am using the latest Flutter stable branch (run flutter upgrade and verify)
  • I have run the tests (flutter test) and all tests are passing (just updating idea doc)

Added/updated tests?

We encourage you to add relevant test cases.

  • Yes
  • No, and this is why: This PR only updates a Markdown document with GSoC PoC results. No application code was modified.

OS on which you have developed and tested the feature?

  • Windows
  • macOS
  • Linux

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants