Add PoC validation for Multimodal AI Eval Framework (#1226)#1334
Open
Spark960 wants to merge 1 commit intofoss42:mainfrom
Open
Add PoC validation for Multimodal AI Eval Framework (#1226)#1334Spark960 wants to merge 1 commit intofoss42:mainfrom
Spark960 wants to merge 1 commit intofoss42:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Description
Hi again,
Following up on the idea discussion thread (#1226) and the feedback from my initial idea submission (#1136) where a PoC was requested, I went ahead and built a full Proof of Concept.
I just want to say, building this was incredibly fun and it genuinely showed me how cool and necessary this project is. Getting to see the live evaluation pipeline actually work end-to-end was so insane. This PR updates my original idea document with the PoC results and architecture findings.
Short Summary of everything
lm-evalsafely in background threads.lm-evalpayloads and cleanses/sanitizes them. This completely solves the vendor specific schema crashes we see with strict APIs like Gemini and Groq (e.g: Gemini instantly throwing a 400 Bad Request if it sees aseedparameter). This proves we can make this tool truly vendor neutral!PoC Repository: https://github.com/Spark960/ai-eval
Demo: A gif demonstration of the live evaluation pipeline streaming via SSE is available in the PoC readme.
I'd love to know your thoughts and critique too :)
@animator and @ashitaprasad
Related Issues
Checklist
mainbranch before making this PRflutter upgradeand verify)flutter test) and all tests are passing (just updating idea doc)Added/updated tests?
We encourage you to add relevant test cases.
OS on which you have developed and tested the feature?