A pipeline to map whale encounters to hydrophone audio.
Derived from PacificSoundDetectHumpbackSong, though not directly affiliated with MBARI, NOAA, or HappyWhale.Stages:
-
Input: When (and where*) to look for whale encounters on HappyWhale.
-
Geometry Search: Query open-oceans/happywhale to find potential whale encounters.
→ Expected outputs: encounter ids, start and end times, and longitude and latitude.
-
Retrive Audio: Download audio from MBARI's Pacific Ocean Sound Recordings around the time of the encounter.
→ Expected outputs: audio array, start and end times, and encounter ids.
-
Parse Audio: Break audio into non-overlaping segments with flagged frequency detections.
→ Expected outputs: cut audio array, detection intervals, and encounter ids.
-
Classify Audio: Use a NOAA and Google's humpback_whale model to classify the flagged segments.
→ Expected outputs: resampled audio, classification score array, and encounter ids.
-
Postprocess Labels: Build clip-intervals for each encounter for playback snippets.
→ Expected outputs: encounter ids, cut/resampled audio array, and aggregated classification score.
-
Output: Map the whale encounter ids to the playback snippets.
Create a virtual environment and install the required packages. We'll use conda for this, but you can use any package manager you prefer.
Since we're developing on an M1 machine, we'll need to specify the CONDA_SUBDIR
to osx-arm64
.
This step should be adapted based on the virtual environment you're using.
CONDA_SUBDIR=osx-arm64 conda create -n whale-speech python=3.11
conda activate whale-speech
pip install -r requirements.txt
conda create -n whale-speech python=3.11
conda activate whale-speech
pip install -r requirements.txt
To run the pipeline on Google Cloud Dataflow, you'll need to install the Google Cloud SDK. You can find the installation instructions here.
Make sure you authentication your using and initialize the project you are using.
gcloud auth login
gcloud init
For newly created projects, each of the services used will need to be enabled. This can be easily done in the console, or via the command line. For example:
gcloud services enable bigquery.googleapis.com
gcloud services enable dataflow.googleapis.com
gcloud services enable storage-api.googleapis.com
gcloud services enable run.googleapis.com
To run the pipeline and model server locally, you can use the make
target local-run
.
make local-run
This target starts by killing any previous model servers that might be running (needed for when a pipeline fails, without tearing down the server, causing the previous call to hang). Then it starts the model server in the background and runs the pipeline.
To build and push the model server to your model registry (stored as an environment variable), you can use the following make
target.
make build-push-model-server
This target builds the model server image and pushes it to the registry specified in the env.sh
file.
The tag is a combination of the version set in the makefile and the last git commit hash.
This helps keep track of what is included in the image, and allows for easy rollback if needed.
The target fails if there are any uncommited changes in the git repository.
The latest
tag is only added to images deployed via GHA.
To run the pipeline on Google Cloud Dataflow, you can use the following make
target.
make run-dataflow
Logging in the terminal will tell you the status of the pipeline, and you can follow the progress in the Dataflow console.
In addition to providing the inference url and filesystem to store outputs on, the definition of the above target also provides an example on how a user can pass additional arguments to and request different resources for the pipeline run.
Pipeline specific parameters You can configure all the paramters set in the config files directly when running the pipeline. The most important here is probably the start and end time for the initial search.
--start "2024-07-11" \
--end "2024-07-11" \
--offset 0 \
--margin 1800 \
--batch_duration 60
Note that any parameters with the same name under different sections will only be updated if its the last section in the list. Also, since these argparse-parameters are added automatically, behavior of boolean flags might be unexpected (always true is added).
Compute resources The default compute resources are quite small and slow. To speed things up, you can request more workers and a larger machine type. For more on Dataflow resources, check out the docs.
--worker_machine_type=n1-highmem-8 \
--disk_size_gb=100 \
--num_workers=8 \
--max_num_workers=8 \
Note, you may need to configure IAM permissions to allow Dataflow Runners to access images in your Artifact Registry. Read more about that here.
- HappyWhale
- open-oceans/happywhale
- NOAA and Google's humpback_whale model
- Google Cloud Console
- Monterey Bay Hydrophone MARS
- MBARI's Pacific Ocean Sound Recordings
- J. Ryan et al., "New Passive Acoustic Monitoring in Monterey Bay National Marine Sanctuary," OCEANS 2016 MTS/IEEE Monterey, Monterey, CA, USA, 2016, pp. 1-8, doi: 10.1109/OCEANS.2016.7761363.