A zero-shot audio + text, bioacoustic benchmark dataset. This dataset is described in our companion paper here: Robinson et al 2025
The dataset is available on Huggingface where we provide a detailed description of the dataset and its components.
The captioning
task needs a java 8 run-time for the evaluation code to run. If you are on
mac visit https://www.java.com/en/download/ and install the appropriate version (Java SE 8).
If you are on a UNIX machine, visit https://openjdk.org/install/ and find your distribution.
You can also do:
sudo apt update
sudo apt install openjdk-8-jre
If you already have a more recent version of java, you can run
sudo update-alternatives --config java
and then select the java 8 version
If you are on Windows, visit: https://www.java.com/en/download/manual.jsp and download the appropriate installer.
If java
is not installed, the captioning
task will be skipped from evaluation.
We recommend installing a virtual environment to avoid package conflicts. Then,
-
Clone this repo.
-
cd beans-zero
-
pip install -e .
There are three types of tasks in the BEANS-Zero dataset:
classification
: in a classification task, the ground truth label is a single class label, e.g. a species name or an environmental sound (see the next section to get info on the labels). Your model'sprediction
should be a single label.detection
: in a detection task, the label can be a comma-separated list of class labels or a single label.- For example, the label can be "species1, species2, species3" or "species1" depending on how many species are detected in the audio.
- Your model should output a comma-separated list (or single item) of class labels as a string. NOTE: whitespace will be ignored, but the comparison is case-sensitive.
captioning
: in a captioning task, the label is a string that describes the content of the audio. Your model'sprediction
should be a string that is similar to the label. We useSPICE
andCIDer
scores to evaluate captioning performance.
NOTE: You can check what labels are expected for each dataset using the
beans-info <dataset_name>
command (see below).
- Each example in the dataset contains an
output
field which is the ground truth label(s) for that example. - Each example contains the
audio
as a list of floats. - And the
instruction_text
field which is the instruction text for that example. All fields are described here Huggingface
This repo offers several tools:
- A cli tool to:
- fetch the dataset from Huggingface
beans-fetch
- list all component datasets
beans-info
or get info on a particular datasetbeans-info <dataset_name>
- compute evaluation metrics if you already have a predictions file with
beans-evaluate
- fetch the dataset from Huggingface
- A
run_benchmark
function to run your model on the dataset and get evaluation metrics directly.
Run python cli.py --help
to see the available commands and a description.
Make sure you have installed beans-zero and have activated the virtual environment. The following command will download the dataset to your local machine. See the next section for more details on how to use the dataset.
beans-fetch
If you would first like to explore the BEANS-Zero dataset you can use this code snippet.
import numpy as np
from datasets import load_dataset
# streaming=True will not download the dataset, rather every example will be downloaded on the fly
# This can be slower but saves disk space
ds = load_dataset("EarthSpeciesProject/BEANS-Zero", split="test", streaming=True)
# or not streaming (full download, about 100GB), much faster, same as running `beans-fetch`
# ds = load_dataset("EarthSpeciesProject/BEANS-Zero", split="test")
# get information on the columns in the dataset
print(ds)
# iterate over the samples
for sample in ds:
break
# each sample is dict
print(sample.keys())
# the component datasets are
dataset_names, dataset_sample_counts = np.unique(ds["dataset_name"], return_counts=True)
# if you want to select a subset of the data
idx = np.where(np.array(ds["dataset_name"]) == "esc50")[0]
esc50 = ds.select(idx)
print(esc50)
# the audio can be accessed as a list of floats, converted to a numpy array
audio = np.array(sample["audio"])
print(audio.shape)
# the instruction text is a string
instruction_text = sample["instruction_text"]
print(instruction_text)
Your predictions file should be a csv or a jsonl (json lines, oriented as 'records') file with the following fields:
prediction
: the predicted string output of your model for each examplelabel
: the expected output (just copy theoutput
field from the dataset for that example)dataset_name
: the name of the dataset (e.g. 'esc50' or 'unseen-family-sci', again just copy thedataset_name
field from the dataset)
Make sure you have installed beans-zero and have activated the virtual environmen, then run:
beans-evaluate /path/to/your/predictions_file.jsonl /path/to/save/metrics.json
The output metrics per dataset component will be saved in the metrics.json
file.
Currently, the supported output file format is json.
We provide a run_benchmark
function that you can use to run your model on the dataset and get evaluation metrics.
- Import the run_benchmark function into your model file.
- Make sure your model class / prediction function is a Callable. It will be called with a dictionary containing the following useful fields (amongst others):
audio
: the audio input as a list of floats.instruction_text
: the instruction text as a string. NOTE: If thebatched
argument inrun_benchmark
is set to True, theaudio
field will be a list of lists of floats, and theinstruction_text
field will be a list of strings withlen(instruction_text) == batch_size
.
- Call the run_benchmark function with your model class / prediction function
as the value of the
model
argument.
Here is an example of how to use the run_benchmark function:
from beans_zero.benchmark import run_benchmark
class MyModel():
"An example model class"
def __init__(self):
# Initialize your model and text tokenizer here
def predict(self, audio: torch.Tensor, text: torch.Tensor) -> torch.Tensor:
"A simple forward pass."
# Do something with the input tensor
return x
def __call__(self, example: dict) -> str:
"A simple call method."
# Extract the audio and text from the example
audio = torch.Tensor(example["audio"])
instruction = self.tokenizer(example["instruction_text"])
# Perform inference
prediction = self.predict(audio, instruction)
# Convert the prediction to a string
prediction = self.tokenizer.decode(prediction)
return prediction
# Create an instance of your model
my_model = MyModel()
path_to_dataset = "EarthSpeciesProject/BEANS-Zero"
# path_to_dataset can be a local folder if you've downloaded the dataset somewhere else on your machine
Now, run the benchmark.
run_benchmark(
model=my_model,
path_to_dataset=path_to_dataset,
streaming=False,
batched=False,
batch_size=0,
output_path="metrics.json",
)
NOTE: streaming=True will not download the dataset, rather every example will be downloaded on the fly which is slower but saves disk space. Set batched=True if your model can handle batches of examples of size batch_size
The output json file will contain metrics organized by dataset component. For example, the output file will look like this:
{
"esc50": {
"Accuracy": 0.8075,
"Precision": 0.8695815295815295,
"Recall": 0.8075,
"F1 Score": 0.8018654450498574,
"Top-1 Accuracy": 0.8075
},
"watkins": {
"Accuracy": 0.8023598820058997,
"Precision": 0.781733382201888,
"Recall": 0.8023598820058997,
"F1 Score": 0.7791986262473836,
"Top-1 Accuracy": 0.7315634218289085
},
.... (other datasets)
If you use this dataset in your research, please cite the following paper:
@inproceedings{robinson2025naturelm,
title = {NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics},
author = {David Robinson and Marius Miron and Masato Hagiwara and Olivier Pietquin},
booktitle = {Proceedings of the International Conference on Learning Representations (ICLR)},
year = {2025},
url = {https://openreview.net/forum?id=hJVdwBpWjt}
}