Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using AgentLab with a custom BG benchmark #99

Open
imenelydiaker opened this issue Oct 31, 2024 · 2 comments
Open

Using AgentLab with a custom BG benchmark #99

imenelydiaker opened this issue Oct 31, 2024 · 2 comments

Comments

@imenelydiaker
Copy link
Collaborator

When using AgentLab with a custom benchmark, I had to update some code files and I wish I wouldn't have to update it, but only adding some arguments to the functions I used. Here is what I did and my suggestions:

The script (main.py) calls this function run_agents_on_benchmark(), that calls get_benchmark_env_args() from task_collection.py module. The function fetches tasks from a given benchmark id, I had to update the file manually to add the task list of my custom benchmark.

...
elif benchmark_name == "my_benchmark":
    from my_benchmark import ALL_MY_BENCHMARK_TASK_IDS
    env_args_list = _make_env_args(ALL_MY_BENCHMARK_TASK_IDS, max_steps, n_repeat, rng)
else:
    raise ValueError(f"Unknown benchmark name: {benchmark_name}")

My suggestion if to add an additionnal argument tasks_list: list[AbstractBrowserTask] to run_agents_on_benchmark() and get_benchmark_env_args() methods. Setting it should bypass the if/else conditions to fetch tasks of a given benchmark name. This would also be valuable if one would like to run only a specific list of tasks, for testing or fast developement.

def get_benchmark_env_args(
    benchmark_name: str = None, tasks_list: list[AbstractBrowserTask] = None, meta_seed=42, max_steps=None, n_repeat=None
) -> list[EnvArgs]:
    # ...
    if tasks_list and len(tasks_list) > 1:
        return _make_env_args(tasks_list, max_steps, n_repeat, rng)
    # Here the code for fetching tasks list from benchmark id
    # ...
    else:
        raise ValueError(f"Unknown benchmark name: {benchmark_name}")

Another function needed to be updated: _get_benchmark_version() from reproductibility_utils.py, but I don't have any suggestion here.

@imenelydiaker
Copy link
Collaborator Author

imenelydiaker commented Nov 1, 2024

A better solution would be to have a Benchmark class, and allow get_benchmark_env_args() to use it instead of benchmark_name. This would allow building custom benchmarks and use existing ones.

@dataclass
class Benchmark:
    name: str
    tasks: list[AbstractBrowserTask]
    max_steps: int

The get_benchmark_env_args funtion would be lighter:

def get_benchmark_env_args(
    benchmark_name: Benchmark, meta_seed=42, n_repeat=None
) -> list[EnvArgs]:
    return _make_env_args(benchmark.tasks, benchmark.max_steps, n_repeat, rng)

We can also imagine having a benchmark registry for all benchmarks provided by browsergym (just a list or dict we store somewhere with Benchmark objects).

@gasse
Copy link
Collaborator

gasse commented Nov 1, 2024

Hi @imenelydiaker , those are very good points!

Things have been moving fast the last few weeks on that side. We now have a Benchmark class in browsergym which seem to address all of the points you mention here. It's been integrated into AgentLab, but maybe just in the dev branch? You can have a look here:

https://github.com/ServiceNow/BrowserGym/blob/908d0ac319d51c5d4d8266187f00a5a3a5c79991/browsergym/experiments/src/browsergym/experiments/benchmark/configs.py#L93-L107

if isinstance(self.benchmark, str):
self.benchmark = bgym.DEFAULT_BENCHMARKS[self.benchmark]()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants