Gemini 1.5 PRO latest + CEDARScript-G edit format by elifarley · Pull Request #1897 · Aider-AI/aider

elifarley · 2024-10-03T03:37:17Z

The new CEDARScript edit format looks promising, as it allowed Gemini-1.5-Flash to surpass Sonnet 3.5.

Here we're not using architect mode, but you can kinda say that Gemini is acting as an architect, and the edit format itself (CEDARScript) is acting as the editor.

Quick comparisons

Sonnet 3.5 + diff

- dirname: refac-claude-3.5-sonnet-diff-not-lazy
  model: claude-3.5-sonnet (diff)
  edit_format: diff
  pass_rate_1: 64.0
  percent_cases_well_formed: 76.4

Gemini 1.5 PRO + diff-fenced (leaderboard site)

- dirname: refac-gemini
  model: gemini/gemini-1.5-pro-latest
  edit_format: diff-fenced
  pass_rate_1: 49.4
  percent_cases_well_formed: 7.9

Gemini 1.5 PRO + diff-fenced (my own tests)

- dirname: 2024-10-05-00-43-21--diff-fenced-Gemini-Refactoring
  test_cases: 89
  model: gemini/gemini-1.5-pro-latest
  edit_format: diff-fenced
  commit_hash: 772710b-dirty
  pass_rate_1: 18.0
  pass_rate_2: 21.3
  pass_rate_3: 24.7
  percent_cases_well_formed: 34.8
  error_outputs: 180
  num_malformed_responses: 180
  num_with_malformed_responses: 58
  user_asks: 128
  lazy_comments: 2
  syntax_errors: 21
  indentation_errors: 93
  exhausted_context_windows: 0
  test_timeouts: 0
  command: aider --model gemini/gemini-1.5-pro-latest
  date: 2024-10-05
  versions: 0.57.2.dev
  seconds_per_case: 110.1
  total_cost: 28.2515

Gemini 1.5 PRO + CEDARScript

- dirname: 2024-10-19-22-48-07--cedarscript-0.3.1-refactoring-gemini1.5pro
  test_cases: 89
  model: gemini/gemini-1.5-pro-latest
  edit_format: cedarscript-g
  commit_hash: 4da1e9b-dirty
  pass_rate_1: 77.5
  percent_cases_well_formed: 86.5
  error_outputs: 337
  num_malformed_responses: 19
  num_with_malformed_responses: 12
  user_asks: 12
  lazy_comments: 0
  syntax_errors: 4
  indentation_errors: 3
  exhausted_context_windows: 0
  test_timeouts: 0
  command: aider --model gemini/gemini-1.5-pro-latest
  date: 2024-10-19
  versions: 0.59.2.dev
  seconds_per_case: 29.0
  total_cost: 26.2374

Gemini 1.5 Flash + CEDARScript

- dirname: 2024-10-20-00-33-27--cedarscript-0.3.1-refactoring-gemini1.5flash
  test_cases: 89
  model: gemini/gemini-1.5-flash-latest
  edit_format: cedarscript-g
  commit_hash: 4da1e9b-dirty
  pass_rate_1: 76.4
  percent_cases_well_formed: 94.4
  error_outputs: 403
  num_malformed_responses: 13
  num_with_malformed_responses: 5
  user_asks: 21
  lazy_comments: 0
  syntax_errors: 3
  indentation_errors: 5
  exhausted_context_windows: 0
  test_timeouts: 0
  command: aider --model gemini/gemini-1.5-flash-latest
  date: 2024-10-20
  versions: 0.59.2.dev
  seconds_per_case: 14.7
  total_cost: 0.6757

functional_Functional__conform_to_reference_input

diff-fenced

    "cost": 0.33188854999999995,
    "duration": 27.793912172317505,
    "test_timeouts": 0,
    "commit_hash": "772710b-dirty",
    "num_error_outputs": 2,
    "num_user_asks": 3,
    "num_exhausted_context_windows": 0,
    "num_malformed_responses": 2,
    "syntax_errors": 0,
    "indentation_errors": 3,
    "lazy_comments": 0,

cedarscript-g

    "cost": 0.18178265,
    "duration": 11.176445960998535,
    "test_timeouts": 0,
    "commit_hash": "772710b-dirty",
    "num_error_outputs": 0,
    "num_user_asks": 1,
    "num_exhausted_context_windows": 0,
    "num_malformed_responses": 0,
    "syntax_errors": 0,
    "indentation_errors": 0,
    "lazy_comments": 0,

See line count comparisons for some refactoring benchmark tasks.

Analysis: CEDARScript vs. Common Edit Formats in AI-Assisted Code Refactoring

The introduction of CEDARScript as an edit format for AI-assisted code refactoring has demonstrated an important leap in performance, particularly when used with Gemini 1.5 PRO and Gemini 1.5 Flash. This analysis compares CEDARScript against traditional diff-based edit formats, revealing striking improvements across multiple metrics.

Overall Performance:

CEDARScript has dramatically enhanced the performance of Gemini models in code refactoring tasks. When paired with Gemini 1.5 PRO, it achieved an impressive 77.5% pass rate and 86.5% well-formed cases, significantly outperforming both its own diff-fenced format results (49.4% pass rate, 7.9% well-formed cases) and the highly regarded Claude 3.5 Sonnet (64.0% pass rate, 76.4% well-formed cases).

Most remarkably, the cost-effective Gemini 1.5 Flash model, when using CEDARScript, not only matched but surpassed the performance of Claude 3.5 Sonnet. With a 76.4% pass rate and an outstanding 94.4% well-formed cases, Gemini 1.5 Flash demonstrates that even a more affordable model can outperform top-tier competitors when equipped with the right tools. This breakthrough suggests that CEDARScript can level the playing field, enabling more accessible AI models to compete with and even exceed the capabilities of more expensive options in complex coding tasks.

Code Quality and Accuracy:

Syntax Errors: CEDARScript reduced syntax errors from 21 to just 4 with Gemini 1.5 PRO, and to 3 with Gemini 1.5 Flash.
Indentation Errors: A dramatic decrease from 93 to 3 errors with Gemini 1.5 PRO, and 5 with Gemini 1.5 Flash.
Lazy Comments: Eliminated entirely across all CEDARScript tests.

These improvements suggest that CEDARScript enables AI models to produce more accurate, syntactically correct, and well-structured code modifications.

Efficiency and Resource Utilization:

Examining the "functional_Functional__conform_to_reference_input" test case:

Cost: CEDARScript reduced costs by 45% (from $0.33 to $0.18).
Duration: Processing time decreased by 60% (from 27.8s to 11.2s).
User Interactions: Required user asks dropped from 3 to 1.

On a larger scale, CEDARScript with Gemini 1.5 PRO reduced the average time per case from 110.1 seconds to 29.0 seconds, a 73.7% improvement. Gemini 1.5 Flash further reduced this to 14.7 seconds, an 86.6% improvement over the original diff-fenced format.

Robustness and Reliability:

While the number of error outputs increased with CEDARScript, the number of malformed responses decreased significantly:

Gemini 1.5 PRO: from 180 to 19 malformed responses
Gemini 1.5 Flash: further reduced to 13 malformed responses

This suggests that while CEDARScript may generate more error outputs, it produces fewer malformed responses, potentially indicating more precise error handling and feedback.

Scalability and Cost-Effectiveness:

CEDARScript demonstrated impressive cost savings:

Gemini 1.5 PRO: Total cost reduced from $28.25 to $26.24 (7.1% savings)
Gemini 1.5 Flash: Dramatically reduced cost to $0.68 (97.6% savings compared to diff-fenced)

This cost reduction, combined with faster processing times, indicates excellent scalability for larger, more complex refactoring tasks.

Model Comparison:

Gemini 1.5 Flash with CEDARScript showed slightly lower pass rates (76.4% vs 77.5%) but higher well-formed case percentages (94.4% vs 86.5%) compared to Gemini 1.5 PRO. The Flash model also demonstrated superior cost-effectiveness and speed, making it an attractive option for many use cases.

Conclusion:

CEDARScript has shown significant improvements for AI-assisted code refactoring.

By improving cost-savings, accuracy, efficiency, and reliability across different models, it addresses many of the challenges associated with traditional diff-based formats.
The consistent performance boost across various metrics indicates that CEDARScript could be an important enabler for AI models to handle complex code transformations more effectively.

These results could have positive implications for developer productivity, code quality, and the future of AI-assisted software development.

fry69 · 2024-10-05T07:51:56Z

What is the point of this PR? The coder does not exist in aider currently.

These numbers are at best for private preview interest, not for public disclosure on the aider website (IMHO).

elifarley · 2024-10-05T10:46:54Z

Ok, I'll make it a draft PR. Once a PR in Aider is created and merged, I can then make this PR ready for review once more.

fry69 · 2024-10-05T12:28:08Z

Once a PR in Aider is created and merged, I can then make this PR ready for review once more.

I'll close this PR until this happened.

Update edit_leaderboard.yml

23b1e65

elifarley marked this pull request as draft October 3, 2024 03:37

elifarley added 2 commits October 4, 2024 00:12

Improved CEDARScript grammar for strings

4bf42b8

Gemini with CEDARScript

82db714

elifarley changed the title ~~Gemini 1.5 PRO latest with CEDARScript-G edit format~~ Gemini 1.5 PRO latest + CEDARScript-G edit format Oct 4, 2024

elifarley added 3 commits October 4, 2024 22:41

Update refactor_leaderboard.yml

6fc2fc0

Update edit_leaderboard.yml

c9ef8e9

Merge branch 'main' into patch-1

581fa16

elifarley marked this pull request as ready for review October 5, 2024 00:03

elifarley marked this pull request as draft October 5, 2024 10:47

fry69 closed this Oct 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemini 1.5 PRO latest + CEDARScript-G edit format#1897

Gemini 1.5 PRO latest + CEDARScript-G edit format#1897
elifarley wants to merge 6 commits intoAider-AI:mainfrom
elifarley:patch-1

elifarley commented Oct 3, 2024 •

edited

Loading

Uh oh!

fry69 commented Oct 5, 2024 •

edited

Loading

Uh oh!

elifarley commented Oct 5, 2024

Uh oh!

fry69 commented Oct 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

elifarley commented Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Quick comparisons

functional_Functional__conform_to_reference_input

diff-fenced

cedarscript-g

Analysis: CEDARScript vs. Common Edit Formats in AI-Assisted Code Refactoring

Overall Performance:

Code Quality and Accuracy:

Efficiency and Resource Utilization:

Robustness and Reliability:

Scalability and Cost-Effectiveness:

Model Comparison:

Conclusion:

Uh oh!

fry69 commented Oct 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elifarley commented Oct 5, 2024

Uh oh!

fry69 commented Oct 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

elifarley commented Oct 3, 2024 •

edited

Loading

fry69 commented Oct 5, 2024 •

edited

Loading