This repository contains the pipeline to obtain and process ONC data from hydrophones. The dataset generation pipeline is divided in steps. Each step can be performed separately, but some of the steps are pre-requisites of the previous ones.
Attention: Is HIGHLY recommended to have a large storage available (at least 2TB) to download the ONC WAV Files.
A Dockerfile is available at this repository to simplify the environment setup. As the needed sources are only Python dependencies, a virtual environment can also be created.
To install the dependencies, run the following commands:
pip install -r requirements.txt
pip install -r requirements-dev.txt
A brief pipeline description can be found below, separating the development into 13 steps:
- Query the ONC server for the deployments of the choosen hydrophones;
- Read the following information: recording begin, recording end, latitude, longitude, depth, and location;
- Save the information into a
.csv
file.
- Search for AIS data from the date choosen;
- Download the
txt
files from ONC.
- Search for WAV data from the date choosen;
- Download the
.wav
files from ONC. WARNING: This step requires a lot of available disk memory. The smallest deployment have more than 1TB of audio data.
This function parse the ais messages downloaded from ONC into JSON files, filtering by the type of the messages and discarting messages without the needed values.
- Find the downloaded
.txt
AIS files; - Keep only the relevant messages. They are: Position report, Static and voyage related data, Standard Class B equipment position report, Extended Class B equipment position report, and Static data report.
- Filter from those messages, only a few informations: Positioning (x and y), SOG, COG, true heading, and type and cargo codes;
- Save the corresponding information into a
.json
file.
- Read the
.json
files into dataframes; - Propagate the 'type_and_cargo' messages throughout the MMSI's;
- Drop messages without positional coordinates and/or duplicates;
- Calculate the distance from the hydrophone to the vessel;
- Filter only the data that fits the choosen scenario;
- Save the corresponding information into a
.feather
file.
- Read the cleaned AIS files;
- Removes the Vessel entries that have just one message;
- Dump AIS data to a monolithic
.feather
file; - Generate a new data with linearly interpolated values to obtain more granularity;
- Combine the raw and interpolated data frames;
- Dump AIS interpolated data to a monolithic
.feather
file;
- Find all of the cleaned AIS files for each deployment;
- Find the time intervals where only one vessel is within range;
- Read
.csv
file to extract periods of timestamp; - Search on raw WAV files folder for the correct period of time;
- Read the wav files and split into 1 minute normalized pieces of audio;
- Group the pieces of audio with the period of ais files range;
- Save into correct folder.
- Search for CTD data from the date choosen;
- Download from ONC.
- Select only information of salinity, conductivity, temperature, pressure, and sound speed;
- Save the corresponding information into a
.feather
file.
- Get the following information from each time period: label, duration, file path, sample rate, class code, date, MMSI;
- Get also a average for the time period of the CTD data: salinity, conductivity, temperature, pressure, and sound speed;
- Normalize the CTD information;
- Save all the data into a
.csv
file.
- Count the occurrences of each class;
- Do a undersample strategy to crop the longer classes according to the smaller one;
- Save all the data into a
.csv
file.
- Read the original metadata generated on Step 10;
- Create a new column named sub_init to accomodate the time frame where this new entry will start;
- Split the
.csv
row according with the duration choosen; - Create a new row for each new entry;
- Save all the data into a
.csv
file.
- Read all the metadata;
- Apply a random sort on the data;
- Save all the data into three
.csv
files: Train, Validation, and Test.
The results from this work were published at IEEE Access, at the following reference:
@article{domingos2022investigation,
author={Domingos, Lucas C. F. and Santos, Paulo E. and Skelton, Phillip S. M. and Brinkworth, Russell S. A. and Sammut, Karl},
journal={IEEE Access},
title={An Investigation of Preprocessing Filters and Deep Learning Methods for Vessel Type Classification With Underwater Acoustic Data},
year={2022},
volume={10},
number={},
pages={117582-117596},
doi={10.1109/ACCESS.2022.3220265}}
A complete literature review containing the background knowledge of this work is available on the following reference:
@article{domingos2022survey,
author={Domingos, Lucas C. F. and Santos, Paulo E. and Skelton, Phillip S. M. and Brinkworth, Russell S. A. and Sammut, Karl},
title={A Survey of Underwater Acoustic Data Classification Methods Using Deep Learning for Shoreline Surveillance},
volume={22},
ISSN={1424-8220},
url={http://dx.doi.org/10.3390/s22062181},
DOI={10.3390/s22062181},
number={6},
publisher={MDPI AG},
journal={Sensors},
year={2022},
month={Mar},
pages={2181}
}
This code, as well as the pipeline formulation and the code used as the basis, was developed in collaboration with Philliip Skelton.
Thanks to Paulo Santos for the guidance and participation in this project.