Using AgentLab with a custom BG benchmark #99

imenelydiaker · 2024-10-31T10:24:50Z

When using AgentLab with a custom benchmark, I had to update some code files and I wish I wouldn't have to update it, but only adding some arguments to the functions I used. Here is what I did and my suggestions:

The script (main.py) calls this function run_agents_on_benchmark(), that calls get_benchmark_env_args() from task_collection.py module. The function fetches tasks from a given benchmark id, I had to update the file manually to add the task list of my custom benchmark.

...
elif benchmark_name == "my_benchmark":
    from my_benchmark import ALL_MY_BENCHMARK_TASK_IDS
    env_args_list = _make_env_args(ALL_MY_BENCHMARK_TASK_IDS, max_steps, n_repeat, rng)
else:
    raise ValueError(f"Unknown benchmark name: {benchmark_name}")

My suggestion if to add an additionnal argument tasks_list: list[AbstractBrowserTask] to run_agents_on_benchmark() and get_benchmark_env_args() methods. Setting it should bypass the if/else conditions to fetch tasks of a given benchmark name. This would also be valuable if one would like to run only a specific list of tasks, for testing or fast developement.

def get_benchmark_env_args(
    benchmark_name: str = None, tasks_list: list[AbstractBrowserTask] = None, meta_seed=42, max_steps=None, n_repeat=None
) -> list[EnvArgs]:
    # ...
    if tasks_list and len(tasks_list) > 1:
        return _make_env_args(tasks_list, max_steps, n_repeat, rng)
    # Here the code for fetching tasks list from benchmark id
    # ...
    else:
        raise ValueError(f"Unknown benchmark name: {benchmark_name}")

Another function needed to be updated: _get_benchmark_version() from reproductibility_utils.py, but I don't have any suggestion here.

The text was updated successfully, but these errors were encountered:

imenelydiaker · 2024-11-01T16:26:17Z

A better solution would be to have a Benchmark class, and allow get_benchmark_env_args() to use it instead of benchmark_name. This would allow building custom benchmarks and use existing ones.

@dataclass
class Benchmark:
    name: str
    tasks: list[AbstractBrowserTask]
    max_steps: int

The get_benchmark_env_args funtion would be lighter:

def get_benchmark_env_args(
    benchmark_name: Benchmark, meta_seed=42, n_repeat=None
) -> list[EnvArgs]:
    return _make_env_args(benchmark.tasks, benchmark.max_steps, n_repeat, rng)

We can also imagine having a benchmark registry for all benchmarks provided by browsergym (just a list or dict we store somewhere with Benchmark objects).

gasse · 2024-11-01T18:55:59Z

Hi @imenelydiaker , those are very good points!

Things have been moving fast the last few weeks on that side. We now have a Benchmark class in browsergym which seem to address all of the points you mention here. It's been integrated into AgentLab, but maybe just in the dev branch? You can have a look here:

https://github.com/ServiceNow/BrowserGym/blob/908d0ac319d51c5d4d8266187f00a5a3a5c79991/browsergym/experiments/src/browsergym/experiments/benchmark/configs.py#L93-L107

AgentLab/src/agentlab/experiments/study.py

Lines 59 to 60 in 6e18fb8

    
           if isinstance(self.benchmark, str): 
        
               self.benchmark = bgym.DEFAULT_BENCHMARKS[self.benchmark]()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using AgentLab with a custom BG benchmark #99

Using AgentLab with a custom BG benchmark #99

imenelydiaker commented Oct 31, 2024

imenelydiaker commented Nov 1, 2024 •

edited

Loading

gasse commented Nov 1, 2024 •

edited

Loading

Using AgentLab with a custom BG benchmark #99

Using AgentLab with a custom BG benchmark #99

Comments

imenelydiaker commented Oct 31, 2024

imenelydiaker commented Nov 1, 2024 • edited Loading

gasse commented Nov 1, 2024 • edited Loading

imenelydiaker commented Nov 1, 2024 •

edited

Loading

gasse commented Nov 1, 2024 •

edited

Loading