Skip to content

Refactoring human evals#27

Merged
m2kulkarni merged 3 commits intosync-upstreamfrom
refactoring_human_evals
Feb 20, 2026
Merged

Refactoring human evals#27
m2kulkarni merged 3 commits intosync-upstreamfrom
refactoring_human_evals

Conversation

@charliemolony
Copy link
Collaborator

Refactored human evaluation scripts so uses same evaluator as eval method

@greptile-apps
Copy link

greptile-apps bot commented Feb 19, 2026

Greptile Summary

Refactored human evaluation scripts to use the shared HumanReplayEvaluator class, eliminating code duplication. Replaced boolean use_rnn/use_transformer config flags with a unified policy_architecture string throughout the codebase. Added support for Transformer architecture in evaluation.

Key changes:

  • evaluate_human_logs.py: Simplified from ~340 lines of custom rollout logic to ~200 lines using HumanReplayEvaluator
  • pufferlib/pufferl.py: Migrated from config["use_rnn"] and config["use_transformer"] to config.get("policy_architecture", "Recurrent")
  • pufferlib/ocean/benchmark/evaluator.py: Extended HumanReplayEvaluator.rollout() to support both Recurrent and Transformer architectures with proper state initialization
  • pufferlib/utils.py: Updated subprocess call to pass --policy-architecture parameter
  • Config and error message improvements

Issues found:

  • Duplicate if __name__ == "__main__" block in evaluate_human_logs.py
  • Duplicate dynamics_model key in args dict
  • Critical loop nesting bug in evaluator.py that multiplies total simulation steps by k_scenarios

Confidence Score: 2/5

  • Critical logic bug will cause incorrect evaluation results
  • The nested loop in evaluator.py lines 677-678 multiplies simulation steps by k_scenarios, causing the evaluation to run k_scenarios² steps instead of k_scenarios steps. Additionally, there are syntax errors (duplicate dict key, duplicate main block) that will cause runtime issues.
  • Pay close attention to pufferlib/ocean/benchmark/evaluator.py (critical loop bug) and evaluate_human_logs.py (syntax errors)

Important Files Changed

Filename Overview
evaluate_human_logs.py Refactored to use HumanReplayEvaluator, added Transformer support; has duplicate dict key and duplicate main block
pufferlib/ocean/benchmark/evaluator.py Added Transformer support and multi-scenario handling to HumanReplayEvaluator.rollout; loop nesting error multiplies steps by k_scenarios
pufferlib/pufferl.py Replaced use_rnn/use_transformer boolean flags with policy_architecture string config throughout
pufferlib/utils.py Added --policy-architecture Transformer argument to human replay evaluation subprocess call
pufferlib/config/ocean/adaptive.ini Changed num_agents to 756, disabled co-players, added policy_architecture = Transformer
pufferlib/ocean/drive/drive.py Enhanced error message to include actual values for agent count mismatch

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[evaluate_human_logs.py main] --> B[Parse args with policy_architecture]
    B --> C{Architecture Type?}
    C -->|Recurrent| D[Create Recurrent Policy]
    C -->|Transformer| E[Create Transformer Policy]
    D --> F[Load state_dict]
    E --> F
    F --> G[Create HumanReplayEvaluator]
    G --> H[Loop through num_rollouts]
    H --> I[Create fresh env]
    I --> J[evaluator.rollout]
    J --> K{Architecture?}
    K -->|Recurrent| L[Init LSTM state]
    K -->|Transformer| M[Init Transformer context]
    L --> N[Run simulation steps]
    M --> N
    N --> O[Collect metrics per scenario]
    O --> P[Return aggregated results]
    P --> H
    H --> Q[Aggregate all rollout results]
    Q --> R[Compute first/last scenario metrics]
    R --> S[Save JSON output]
Loading

Last reviewed commit: 0661e02

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +296 to 300
if __name__ == "__main__":
main()

if __name__ == "__main__":
main()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate if __name__ == "__main__" block

Suggested change
if __name__ == "__main__":
main()
if __name__ == "__main__":
main()

"control_mode": "control_vehicles",
"episode_length": scenario_length,
"report_all_scenarios": args_parsed.adaptive_driving_agent,
"dynamics_model" : "classic" ,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate key dynamics_model - already defined at line 150

Suggested change
"dynamics_model" : "classic" ,
"dynamics_model": args_parsed.dynamics_model,

Comment on lines +677 to +678
for scenario in range(k_scenarios):
for time_idx in range(self.sim_steps):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorrect loop nesting: self.sim_steps is already k_scenarios * scenario_length - init_steps (line 630), so looping through scenarios again multiplies total steps by k_scenarios

Suggested change
for scenario in range(k_scenarios):
for time_idx in range(self.sim_steps):
# Run through all steps (already accounts for all scenarios)
for time_idx in range(self.sim_steps):

Charlie Peter Molony and others added 2 commits February 19, 2026 10:58
@m2kulkarni m2kulkarni merged commit a947e1e into sync-upstream Feb 20, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants