Refactoring human evals by charliemolony · Pull Request #27 · Emerge-Lab/Adaptive_Driving_Agent

charliemolony · 2026-02-19T14:39:28Z

Refactored human evaluation scripts so uses same evaluator as eval method

greptile-apps · 2026-02-19T14:42:31Z

Greptile Summary

Refactored human evaluation scripts to use the shared HumanReplayEvaluator class, eliminating code duplication. Replaced boolean use_rnn/use_transformer config flags with a unified policy_architecture string throughout the codebase. Added support for Transformer architecture in evaluation.

Key changes:

evaluate_human_logs.py: Simplified from ~340 lines of custom rollout logic to ~200 lines using HumanReplayEvaluator
pufferlib/pufferl.py: Migrated from config["use_rnn"] and config["use_transformer"] to config.get("policy_architecture", "Recurrent")
pufferlib/ocean/benchmark/evaluator.py: Extended HumanReplayEvaluator.rollout() to support both Recurrent and Transformer architectures with proper state initialization
pufferlib/utils.py: Updated subprocess call to pass --policy-architecture parameter
Config and error message improvements

Issues found:

Duplicate if __name__ == "__main__" block in evaluate_human_logs.py
Duplicate dynamics_model key in args dict
Critical loop nesting bug in evaluator.py that multiplies total simulation steps by k_scenarios

Confidence Score: 2/5

Critical logic bug will cause incorrect evaluation results
The nested loop in evaluator.py lines 677-678 multiplies simulation steps by k_scenarios, causing the evaluation to run k_scenarios² steps instead of k_scenarios steps. Additionally, there are syntax errors (duplicate dict key, duplicate main block) that will cause runtime issues.
Pay close attention to pufferlib/ocean/benchmark/evaluator.py (critical loop bug) and evaluate_human_logs.py (syntax errors)

Important Files Changed

Filename	Overview
evaluate_human_logs.py	Refactored to use `HumanReplayEvaluator`, added Transformer support; has duplicate dict key and duplicate main block
pufferlib/ocean/benchmark/evaluator.py	Added Transformer support and multi-scenario handling to `HumanReplayEvaluator.rollout`; loop nesting error multiplies steps by k_scenarios
pufferlib/pufferl.py	Replaced `use_rnn`/`use_transformer` boolean flags with `policy_architecture` string config throughout
pufferlib/utils.py	Added `--policy-architecture Transformer` argument to human replay evaluation subprocess call
pufferlib/config/ocean/adaptive.ini	Changed `num_agents` to 756, disabled co-players, added `policy_architecture = Transformer`
pufferlib/ocean/drive/drive.py	Enhanced error message to include actual values for agent count mismatch

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[evaluate_human_logs.py main] --> B[Parse args with policy_architecture]
    B --> C{Architecture Type?}
    C -->|Recurrent| D[Create Recurrent Policy]
    C -->|Transformer| E[Create Transformer Policy]
    D --> F[Load state_dict]
    E --> F
    F --> G[Create HumanReplayEvaluator]
    G --> H[Loop through num_rollouts]
    H --> I[Create fresh env]
    I --> J[evaluator.rollout]
    J --> K{Architecture?}
    K -->|Recurrent| L[Init LSTM state]
    K -->|Transformer| M[Init Transformer context]
    L --> N[Run simulation steps]
    M --> N
    N --> O[Collect metrics per scenario]
    O --> P[Return aggregated results]
    P --> H
    H --> Q[Aggregate all rollout results]
    Q --> R[Compute first/last scenario metrics]
    R --> S[Save JSON output]

_{Last reviewed commit: 0661e02}

greptile-apps

_{6 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-19T14:42:35Z

evaluate_human_logs.py

+if __name__ == "__main__":
+    main()

 if __name__ == "__main__":
    main()


duplicate if __name__ == "__main__" block

Suggested change

if __name__ == "__main__":

main()

if __name__ == "__main__":

main()

greptile-apps · 2026-02-19T14:42:35Z

evaluate_human_logs.py

+            "control_mode": "control_vehicles",
+            "episode_length": scenario_length,
+            "report_all_scenarios": args_parsed.adaptive_driving_agent,
+            "dynamics_model" : "classic" ,


duplicate key dynamics_model - already defined at line 150

Suggested change

"dynamics_model" : "classic" ,

"dynamics_model": args_parsed.dynamics_model,

greptile-apps · 2026-02-19T14:42:36Z

pufferlib/ocean/benchmark/evaluator.py

+        for scenario in range(k_scenarios):
+            for time_idx in range(self.sim_steps):


incorrect loop nesting: self.sim_steps is already k_scenarios * scenario_length - init_steps (line 630), so looping through scenarios again multiplies total steps by k_scenarios

Suggested change

for scenario in range(k_scenarios):

for time_idx in range(self.sim_steps):

# Run through all steps (already accounts for all scenarios)

for time_idx in range(self.sim_steps):

Refactoring human evals

0661e02

greptile-apps bot reviewed Feb 19, 2026

View reviewed changes

Charlie Peter Molony and others added 2 commits February 19, 2026 10:58

fixing re commit issues

aa9aeb5

small fixes

f374a3f

m2kulkarni merged commit a947e1e into sync-upstream Feb 20, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring human evals#27

Refactoring human evals#27
m2kulkarni merged 3 commits intosync-upstreamfrom
refactoring_human_evals

charliemolony commented Feb 19, 2026

Uh oh!

greptile-apps bot commented Feb 19, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 19, 2026

Uh oh!

greptile-apps bot Feb 19, 2026

Uh oh!

greptile-apps bot Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	"dynamics_model" : "classic" ,
	"dynamics_model": args_parsed.dynamics_model,

		for scenario in range(k_scenarios):
		for time_idx in range(self.sim_steps):

Conversation

charliemolony commented Feb 19, 2026

Uh oh!

greptile-apps bot commented Feb 19, 2026

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants