-
Write a Solver function that solves the problem -
A solver is all the scaffolding to evaluate, i.e orchestration + prompts + results etc. -
This would be useful when writing the full agent too. -
Fix the input and output format of the solver -
Use OpenAI eval fw to evaluate the solver-
Metrics need to be defined -
Initially just accuracy is the focus
-
- Write eval processor that can take the view json as input
- convert eval item to a scan specific item
- run it for multiple eval items
- Write a code scanner that can take a scan item as input and perform llm scan using it
- Add OpenAI and Anthropic capability
- Add batching
- Add metric collection capability from eval run
- Token metrics
-
Accuracy metrics -
Script to generate a report of a eval
- Get CWE categories that are unique and add labels accordingly
-
Create a process pipeline for each eval type
- Simple query
- Batching and code items segregation
- tagged + query
- Categorization
- Create broad functional areas and associated cwe issues to use in categorization
- AI call: Categorize code into these "possible" vulnerability buckets using a llm. Pure categorization task.
- Prompt should work in chain of thought manner to detect, verify and score issues
-
Run eval for multiple eval sets